Computation and Language 115
☆ Theoretical Benefit and Limitation of Diffusion Language Model
Diffusion language models have emerged as a promising approach for text
generation. One would naturally expect this method to be an efficient
replacement for autoregressive models since multiple tokens can be sampled in
parallel during each diffusion step. However, its efficiency-accuracy trade-off
is not yet well understood. In this paper, we present a rigorous theoretical
analysis of a widely used type of diffusion language model, the Masked
Diffusion Model (MDM), and find that its effectiveness heavily depends on the
target evaluation metric. Under mild conditions, we prove that when using
perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
steps regardless of sequence length, demonstrating that efficiency can be
achieved without sacrificing performance. However, when using the sequence
error rate--which is important for understanding the "correctness" of a
sequence, such as a reasoning chain--we show that the required sampling steps
must scale linearly with sequence length to obtain "correct" sequences, thereby
eliminating MDM's efficiency advantage over autoregressive models. Our analysis
establishes the first theoretical foundation for understanding the benefits and
limitations of MDMs. All theoretical findings are supported by empirical
studies.
comment: 32 pages, 3 figures
☆ MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
Answering questions with Chain-of-Thought (CoT) has significantly enhanced
the reasoning capabilities of Large Language Models (LLMs), yet its impact on
Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
investigation. In this paper, we introduce MME-CoT, a specialized benchmark
evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
science, OCR, logic, space-time, and general scenes. As the first comprehensive
study in this area, we propose a thorough evaluation suite incorporating three
novel metrics that assess the reasoning quality, robustness, and efficiency at
a fine-grained level. Leveraging curated high-quality data and a unique
evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
uncovering several key insights: 1) Models with reflection mechanism
demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
demonstrating the highest quality results; 2) CoT prompting often degrades LMM
performance on perception-heavy tasks, suggesting a potentially harmful
overthinking behavior; and 3) Although the CoT quality is high, LMMs with
reflection exhibit significant inefficiency in both normal response and
self-correction phases. We hope MME-CoT serves as a foundation for advancing
multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
comment: Project Page: https://mmecot.github.io/
☆ Exploring the Potential of Encoder-free Architectures in 3D LMMs
Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Encoder-free architectures have been preliminarily explored in the 2D visual
domain, yet it remains an open question whether they can be effectively applied
to 3D understanding scenarios. In this paper, we present the first
comprehensive investigation into the potential of encoder-free architectures to
overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
These challenges include the failure to adapt to varying point cloud
resolutions and the point features from the encoder not meeting the semantic
needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
We propose the LLM-embedded Semantic Encoding strategy in the pre-training
stage, exploring the effects of various point cloud self-supervised losses. And
we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
introduce the Hierarchical Geometry Aggregation strategy in the instruction
tuning stage. This incorporates inductive bias into the LLM early layers to
focus on the local details of the point clouds. To the end, we present the
first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
classification, captioning, and VQA tasks, respectively. Our results
demonstrate that the encoder-free architecture is highly promising for
replacing encoder-based architectures in the field of 3D understanding. The
code is released at https://github.com/Ivan-Tang-3D/ENEL
comment: The code is released at https://github.com/Ivan-Tang-3D/ENEL
☆ Human-LLM Coevolution: Evidence from Academic Writing
With a statistical analysis of arXiv paper abstracts, we report a marked drop
in the frequency of several words previously identified as overused by ChatGPT,
such as "delve", starting soon after they were pointed out in early 2024. The
frequency of certain other words favored by ChatGPT, such as "significant", has
instead kept increasing. These phenomena suggest that some authors of academic
papers have adapted their use of large language models (LLMs), for example, by
selecting outputs or applying modifications to the LLM-generated content. Such
coevolution and cooperation of humans and LLMs thus introduce additional
challenges to the detection of machine-generated text in real-world scenarios.
Estimating the impact of LLMs on academic writing by examining word frequency
remains feasible, and more attention should be paid to words that were already
frequently employed, including those that have decreased in frequency.
☆ SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
generate high-quality, fine-grained, sentence-level citations for the
statements in their generated responses. Instead of only relying on costly and
labor-intensive annotations, SelfCite leverages a reward signal provided by the
LLM itself through context ablation: If a citation is necessary, removing the
cited text from the context should prevent the same response; if sufficient,
retaining the cited text alone should preserve the same response. This reward
can guide the inference-time best-of-N sampling strategy to improve citation
quality significantly, as well as be used in preference optimization to
directly fine-tune the models for generating better citations. The
effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
points on the LongBench-Cite benchmark across five long-form question answering
tasks.
comment: Implementation available at https://github.com/voidism/SelfCite
☆ CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Chain-of-Thought significantly enhances a model's reasoning capability, but
it also comes with a considerable increase in inference costs due to long
chains. With the observation that the reasoning path can be easily compressed
under easy tasks but struggle on hard tasks, we explore the feasibility of
elastically controlling the length of reasoning paths with only one model,
thereby reducing the inference overhead of reasoning models dynamically based
on task difficulty. We introduce a new tuning and inference strategy named
CoT-Valve, designed to allow models to generate reasoning chains of varying
lengths. To achieve this, we propose to identify a direction in the parameter
space that, when manipulated, can effectively control the length of generated
CoT. Moreover, we show that this property is valuable for compressing the
reasoning chain. We construct datasets with chains from long to short for the
same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
length-compressible CoT tuning method, and (2) a progressive chain length
compression approach. Our experiments show that CoT-Valve successfully enables
controllability and compressibility of the chain and shows better performance
than the prompt-based control. We applied this method to QwQ-32B-Preview,
reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
only one additional incorrect answer.
comment: Work in progress. Code will be released at
https://github.com/horseee/CoT-Valve
☆ Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs ICLR 2025
Large Language Models (LLMs) are increasingly used as chatbots, yet their
ability to personalize responses to user preferences remains limited. We
introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
and adhere to user preferences in a long-context conversational setting.
PrefEval comprises 3,000 manually curated user preference and query pairs
spanning 20 topics. PrefEval contains user personalization or preference
information in both explicit and implicit forms, and evaluates LLM performance
using a generation and a classification task. With PrefEval, we evaluated the
aforementioned preference following capabilities of 10 open-source and
proprietary LLMs in multi-session conversations with varying context lengths up
to 100k tokens. We benchmark with various prompting, iterative feedback, and
retrieval-augmented generation methods. Our benchmarking effort reveals that
state-of-the-art LLMs face significant challenges in proactively following
users' preferences during conversations. In particular, in zero-shot settings,
preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
across most evaluated models. Even with advanced prompting and retrieval
methods, preference following still deteriorates in long-context conversations.
Furthermore, we show that fine-tuning on PrefEval significantly improves
performance. We believe PrefEval serves as a valuable resource for measuring,
understanding, and enhancing LLMs' preference following abilities, paving the
way for personalized conversational agents. Our code and dataset are available
at https://prefeval.github.io/.
comment: Accepted at ICLR 2025 as oral presentation. Code and data at:
https://prefeval.github.io/
☆ Logical forms complement probability in understanding language model (and human) performance
With the increasing interest in using large language models (LLMs) for
planning in natural language, understanding their behaviors becomes an
important research question. This work conducts a systematic investigation of
LLMs' ability to perform logical reasoning in natural language. We introduce a
controlled dataset of hypothetical and disjunctive syllogisms in propositional
and modal logic and use it as the testbed for understanding LLM performance.
Our results lead to novel insights in predicting LLM behaviors: in addition to
the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
forms should be considered as orthogonal factors. In addition, we show
similarities and differences between the logical reasoning performances of
humans and LLMs by comparing LLM and human behavioral results.
comment: Preprint
☆ Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering
In this study, we tackle industry challenges in video content classification
by exploring and optimizing GPT-based models for zero-shot classification
across seven critical categories of video quality. We contribute a novel
approach to improving GPT's performance through prompt optimization and policy
refinement, demonstrating that simplifying complex policies significantly
reduces false negatives. Additionally, we introduce a new
decomposition-aggregation-based prompt engineering technique, which outperforms
traditional single-prompt methods. These experiments, conducted on real
industry problems, show that thoughtful prompt design can substantially enhance
GPT's performance without additional finetuning, offering an effective and
scalable solution for improving video classification systems across various
domains in industry.
☆ MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing NAACL 2025
We introduce MorphNLI, a modular step-by-step approach to natural language
inference (NLI). When classifying the premise-hypothesis pairs into
{entailment, contradiction, neutral}, we use a language model to generate the
necessary edits to incrementally transform (i.e., morph) the premise into the
hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
progresses with these atomic changes, aggregating these intermediate labels
into a final output. We demonstrate the advantages of our proposed method
particularly in realistic cross-domain settings, where our method always
outperforms strong baselines with improvements up to 12.6% (relative). Further,
our proposed approach is explainable as the atomic edits can be used to
understand the overall NLI label.
comment: 16 pages, 11 figures, 8 tables. Accepted for NAACL 2025 Findings
☆ Zero-shot generation of synthetic neurosurgical data with large language models
Clinical data is fundamental to advance neurosurgical research, but access is
often constrained by data availability, small sample sizes, privacy
regulations, and resource-intensive preprocessing and de-identification
procedures. Synthetic data offers a potential solution to challenges associated
with accessing and using real-world data (RWD). This study aims to evaluate the
capability of zero-shot generation of synthetic neurosurgical data with a large
language model (LLM), GPT-4o, by benchmarking with the conditional tabular
generative adversarial network (CTGAN). Synthetic datasets were compared to
real-world neurosurgical data to assess fidelity (means, proportions,
distributions, and bivariate correlations), utility (ML classifier performance
on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
datasets matched or exceeded CTGAN performance, despite no fine-tuning or
access to RWD for pre-training. Datasets demonstrated high univariate and
bivariate fidelity to RWD without directly exposing any real patient records,
even at amplified sample size. Training an ML classifier on GPT-4o-generated
data and testing on RWD for a binary prediction task showed an F1 score (0.706)
with comparable performance to training on the CTGAN data (0.705) for
predicting postoperative functional status deterioration. GPT-4o demonstrated a
promising ability to generate high-fidelity synthetic neurosurgical data. These
findings also indicate that data synthesized with GPT-4o can effectively
augment clinical data with small sample sizes, and train ML models for
prediction of neurosurgical outcomes. Further investigation is necessary to
improve the preservation of distributional characteristics and boost classifier
performance.
comment: 13 pages, 4 figures, 4 tables
☆ EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
agents offers a promising avenue for tackling real-world tasks. While
language-centric embodied agents have garnered substantial attention,
MLLM-based embodied agents remain underexplored due to the lack of
comprehensive evaluation frameworks. To bridge this gap, we introduce
EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
tasks across four environments, ranging from high-level semantic tasks (e.g.,
household) to low-level tasks involving atomic actions (e.g., navigation and
manipulation); and (2) six meticulously curated subsets evaluating essential
agent capabilities like commonsense reasoning, complex instruction
understanding, spatial awareness, visual perception, and long-term planning.
Through extensive experiments, we evaluated 13 leading proprietary and
open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
at high-level tasks but struggle with low-level manipulation, with the best
model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
multifaceted standardized evaluation platform that not only highlights existing
challenges but also offers valuable insights to advance MLLM-based embodied
agents. Our code is available at https://embodiedbench.github.io.
comment: 51 pages
☆ Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages
Recent advances in generative AI have precipitated a proliferation of novel
writing assistants. These systems typically rely on multilingual large language
models (LLMs), providing globalized workers the ability to revise or create
diverse forms of content in different languages. However, there is substantial
evidence indicating that the performance of multilingual LLMs varies between
languages. Users who employ writing assistance for multiple languages are
therefore susceptible to disparate output quality. Importantly, recent research
has shown that people tend to generalize algorithmic errors across independent
tasks, violating the behavioral axiom of choice independence. In this paper, we
analyze whether user utilization of novel writing assistants in a charity
advertisement writing task is affected by the AI's performance in a second
language. Furthermore, we quantify the extent to which these patterns translate
into the persuasiveness of generated charity advertisements, as well as the
role of peoples' beliefs about LLM utilization in their donation choices. Our
results provide evidence that writers who engage with an LLM-based writing
assistant violate choice independence, as prior exposure to a Spanish LLM
reduces subsequent utilization of an English LLM. While these patterns do not
affect the aggregate persuasiveness of the generated advertisements, people's
beliefs about the source of an advertisement (human versus AI) do. In
particular, Spanish-speaking female participants who believed that they read an
AI-generated advertisement strongly adjusted their donation behavior downwards.
Furthermore, people are generally not able to adequately differentiate between
human-generated and LLM-generated ads. Our work has important implications for
the design, development, integration, and adoption of multilingual LLMs as
assistive agents -- particularly in writing tasks.
☆ Improve LLM-based Automatic Essay Scoring with Linguistic Features AAAI
Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
grading workload for instructors. Developing a scoring system capable of
handling essays across diverse prompts is challenging due to the flexibility
and diverse nature of the writing task. Existing methods typically fall into
two categories: supervised feature-based approaches and large language model
(LLM)-based methods. Supervised feature-based approaches often achieve higher
performance but require resource-intensive training. In contrast, LLM-based
methods are computationally efficient during inference but tend to suffer from
lower performance. This paper combines these approaches by incorporating
linguistic features into LLM-based scoring. Experimental results show that this
hybrid method outperforms baseline models for both in-domain and out-of-domain
writing prompts.
comment: To be published in the workshop Innovation and Responsibility in
AI-Supported Education (iRaise) at the 2025 Conference on Artificial
Intelligence (AAAI)
☆ Objective quantification of mood states using large language models
Emotional states influence human behaviour and cognition, leading to diverse
thought trajectories. Similarly, Large Language Models (LLMs) showcase an
excellent level of response consistency across wide-ranging contexts (prompts).
We leverage these parallels to establish a framework for quantifying mental
states. Our approach utilises self-report questionnaires that reliably assess
these states due to their inherent sensitivity to patterns of co-occurring
responses. Specifically, we recruited a large sample of participants (N=422) to
investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
of depressive mood states measured with participants' open-ended responses to a
depression questionnaire. We show LLM responses to held-out multiple-choice
questions, given participants' open-ended answers, correlate strongly (r:
0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
from mood representations. We explore a link between these representations and
factor analysis. Using ridge regression, we find depression-related subspaces
within LLM hidden states. We show these subspaces to be predictive of
participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
well as suicidality severity. Overall, LLMs can provide quantitative measures
of mental states. The reliability of these hinges upon how informative the
questions we ask participants are. Used correctly, this approach could
supplement mental state assessment in a variety of settings.
comment: main text - 9 pages, 5 figures;
☆ The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models
While reasoning and multilingual capabilities in Language Models (LMs) have
achieved remarkable progress in recent years, their integration into a unified
paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
requires language models to handle logical reasoning across languages while
addressing misalignment, biases, and challenges in low-resource settings. This
survey provides the first in-depth review of multilingual reasoning in LMs. In
this survey, we provide a systematic overview of existing methods that leverage
LMs for multilingual reasoning, specifically outlining the challenges,
motivations, and foundational aspects of applying language models to reason
across diverse languages. We provide an overview of the standard data resources
used for training multilingual reasoning in LMs and the evaluation benchmarks
employed to assess their multilingual capabilities. Next, we analyze various
state-of-the-art methods and their performance on these benchmarks. Finally, we
explore future research opportunities to improve multilingual reasoning in LMs,
focusing on enhancing their ability to handle diverse languages and complex
reasoning tasks.
☆ Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Existing visual perception systems focus on region-level segmentation in
single-turn dialogues, relying on complex and explicit query instructions. Such
systems cannot reason at the pixel level and comprehend dynamic user intent
that changes over interaction. Our work tackles this issue by introducing a
novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
multi-turn conversations, tracking evolving user intent via multi-turn
interactions for fine-grained segmentation. To establish a benchmark for this
novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
multi-turn conversational scenarios with segmentation targets. Building on
PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
Segmentation framework, integrates pixel-level segmentation with robust
multi-turn conversation understanding, generating pixel-grounded explanations
aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
pixel-level reasoning segmentation. Experimental results on the PRIST dataset
demonstrate that our method outperforms current segmentation-specific baselines
in terms of segmentation and LLM-based reasoning metrics. The code and data are
available at: https://github.com/ccccai239/PixelRIST.
☆ On multi-token prediction for efficient LLM inference
We systematically investigate multi-token prediction (MTP) capabilities
within LLMs pre-trained for next-token prediction (NTP). We first show that
such models inherently possess MTP capabilities via numerical marginalization
over intermediate token probabilities, though performance is data-dependent and
improves with model scale. Furthermore, we explore the challenges of
integrating MTP heads into frozen LLMs and find that their hidden layers are
strongly specialized for NTP, making adaptation non-trivial. Finally, we show
that while joint training of MTP heads with the backbone improves performance,
it cannot fully overcome this barrier, prompting further research in this
direction. Our findings provide a deeper understanding of MTP applied to
pretrained LLMs, informing strategies for accelerating inference through
parallel token prediction.
☆ Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?
One of the goals of automatic evaluation metrics in grammatical error
correction (GEC) is to rank GEC systems such that it matches human preferences.
However, current automatic evaluations are based on procedures that diverge
from human evaluation. Specifically, human evaluation derives rankings by
aggregating sentence-level relative evaluation results, e.g., pairwise
comparisons, using a rating algorithm, whereas automatic evaluation averages
sentence-level absolute scores to obtain corpus-level scores, which are then
sorted to determine rankings. In this study, we propose an aggregation method
for existing automatic evaluation metrics which aligns with human evaluation
methods to bridge this gap. We conducted experiments using various metrics,
including edit-based metrics, $n$-gram based metrics, and sentence-level
metrics, and show that resolving the gap improves results for the most of
metrics on the SEEDA benchmark. We also found that even BERT-based metrics
sometimes outperform the metrics of GPT-4. We publish our unified
implementation of the metrics and meta-evaluations.
comment: 4 pages, 2 figures
☆ SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models
In the rapidly evolving field of Natural Language Processing, Large Language
Models (LLMs) are tasked with increasingly complex reasoning challenges.
Traditional methods like chain-of-thought prompting have shown promise but
often fall short in fully leveraging a model's reasoning capabilities. This
paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
novel prompting technique designed to improve reasoning through a
self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
models to generate and resolve multiple auxiliary questions before tackling the
main query, promoting a more thorough exploration of various aspects of a
topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
across multiple question-answering datasets, demonstrate that SQuARE
significantly surpasses traditional CoT prompts and existing
rephrase-and-respond methods. By systematically decomposing queries, SQuARE
advances LLM capabilities in reasoning tasks. The code is publicly available at
https://github.com/IntelLabs/RAG-FiT/tree/square.
comment: 14 pages
☆ Truth Knows No Language: Evaluating Truthfulness Beyond English
Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
We introduce a professionally translated extension of the TruthfulQA
benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
Spanish. Truthfulness evaluations of large language models (LLMs) have
primarily been conducted in English. However, the ability of LLMs to maintain
truthfulness across languages remains under-explored. Our study evaluates 12
state-of-the-art open LLMs, comparing base and instruction-tuned models using
human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
findings reveal that, while LLMs perform best in English and worst in Basque
(the lowest-resourced language), overall truthfulness discrepancies across
languages are smaller than anticipated. Furthermore, we show that
LLM-as-a-Judge correlates more closely with human judgments than
multiple-choice metrics, and that informativeness plays a critical role in
truthfulness assessment. Our results also indicate that machine translation
provides a viable approach for extending truthfulness benchmarks to additional
languages, offering a scalable alternative to professional translation.
Finally, we observe that universal knowledge questions are better handled
across languages than context- and time-dependent ones, highlighting the need
for truthfulness evaluations that account for cultural and temporal
variability. Dataset and code are publicly available under open licenses.
comment: 13 pages, 5 figures, 8 tables
☆ Language Agents as Digital Representatives in Collective Decision-Making
Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
Consider the process of collective decision-making, in which a group of
individuals interactively select a preferred outcome from among a universe of
alternatives. In this context, "representation" is the activity of making an
individual's preferences present in the process via participation by a proxy
agent -- i.e. their "representative". To this end, learned models of human
behavior have the potential to fill this role, with practical implications for
multi-agent scenario studies and mechanism design. In this work, we investigate
the possibility of training \textit{language agents} to behave in the capacity
of representatives of human agents, appropriately expressing the preferences of
those individuals whom they stand for. First, we formalize the setting of
\textit{collective decision-making} -- as the episodic process of interaction
between a group of agents and a decision mechanism. On this basis, we then
formalize the problem of \textit{digital representation} -- as the simulation
of an agent's behavior to yield equivalent outcomes from the mechanism.
Finally, we conduct an empirical case study in the setting of
\textit{consensus-finding} among diverse humans, and demonstrate the
feasibility of fine-tuning large language models to act as digital
representatives.
☆ Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs NAACL
Despite advances in the multilingual capabilities of Large Language Models
(LLMs) across diverse tasks, English remains the dominant language for LLM
research and development. So, when working with a different language, this has
led to the widespread practice of pre-translation, i.e., translating the task
prompt into English before inference. Selective pre-translation, a more
surgical approach, focuses on translating specific prompt components. However,
its current use is sporagic and lacks a systematic research foundation.
Consequently, the optimal pre-translation strategy for various multilingual
settings and tasks remains unclear. In this work, we aim to uncover the optimal
setup for pre-translation by systematically assessing its use. Specifically, we
view the prompt as a modular entity, composed of four functional parts:
instruction, context, examples, and output, either of which could be translated
or not. We evaluate pre-translation strategies across 35 languages covering
both low and high-resource languages, on various tasks including Question
Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
(NER), and Abstractive Summarization. Our experiments show the impact of
factors as similarity to English, translation quality and the size of
pre-trained data, on the model performance with pre-translation. We suggest
practical guidelines for choosing optimal strategies in various multilingual
settings.
comment: Accepted for NAACL findings 2025
☆ A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis
Evaluating the open-ended text generation of large language models (LLMs) is
challenging because of the lack of a clear ground truth and the high cost of
human or LLM-based assessments. We propose a novel benchmark that evaluates
LLMs using n-gram statistics and rules, without relying on human judgement or
LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
evaluations while requiring significantly fewer computational resources,
demonstrating its effectiveness as a scalable alternative for assessing LLMs'
open-ended generation capabilities.
comment: 13 pages
☆ When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models
Modern Large Language Models (LLMs) have shown human-like abilities in many
language tasks, sparking interest in comparing LLMs' and humans' language
processing. In this paper, we conduct a detailed comparison of the two on a
sentence comprehension task using garden-path constructions, which are
notoriously challenging for humans. Based on psycholinguistic research, we
formulate hypotheses on why garden-path sentences are hard, and test these
hypotheses on human participants and a large suite of LLMs using comprehension
questions. Our findings reveal that both LLMs and humans struggle with specific
syntactic complexities, with some models showing high correlation with human
comprehension. To complement our findings, we test LLM comprehension of
garden-path constructions with paraphrasing and text-to-image generation tasks,
and find that the results mirror the sentence comprehension question results,
further validating our findings on LLM understanding of these constructions.
☆ SparQLe: Speech Queries to Text Translation Through LLMs
With the growing influence of Large Language Models (LLMs), there is
increasing interest in integrating speech representations with them to enable
more seamless multi-modal processing and speech understanding. This study
introduces a novel approach that leverages self-supervised speech
representations in combination with instruction-tuned LLMs for speech-to-text
translation. The proposed approach leverages a modality adapter to align
extracted speech features with instruction-tuned LLMs using English-language
data. Our experiments demonstrate that this method effectively preserves the
semantic content of the input speech and serves as an effective bridge between
self-supervised speech models and instruction-tuned LLMs, offering a promising
solution for various speech understanding applications.
☆ The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics
Joint entity-relation extraction is a critical task in transforming
unstructured or semi-structured text into triplets, facilitating the
construction of large-scale knowledge graphs, and supporting various downstream
applications. Despite its importance, research on Chinese text, particularly
with complex semantics in specialized domains like medicine, remains limited.
To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
dataset designed to capture the intricacies of medical text. Leveraging the
strengths of attention mechanisms in capturing long-range dependencies, we
propose the SEA module, which enhances the extraction of complex contextual
semantic information, thereby improving entity recognition and relation
extraction. Additionally, to address the inefficiencies of existing methods in
facilitating information exchange between entity recognition and relation
extraction, we present an interactive fusion representation module. This module
employs Cross Attention for bidirectional information exchange between the
tasks and further refines feature extraction through BiLSTM. Experimental
results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
our model exhibits strong generalization capabilities. On the CH-DDI dataset,
our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
relation extraction. On the CoNLL04 dataset, it attains an entity recognition
precision of 89.54% and a relation extraction accuracy of 71.64%.
☆ You Do Not Fully Utilize Transformer's Representation Capacity
In contrast to RNNs, which compress previous tokens into a single hidden
state, Transformers can attend to all previous tokens directly. However,
standard Transformers only use representations from the immediately preceding
layer. In this paper, we show that this design choice causes representation
collapse and leads to suboptimal performance. To address this issue, we
introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
preserves the model's overall memory footprint while expanding its
representational capacity by allowing access to hidden states from earlier
layers. Through extensive experiments across various architectures and
different lookup mechanisms, we demonstrate consistent performance improvements
on a wide range of tasks. Moreover, our analysis of the learned representation
dynamics and our exploration of depthwise circuits reveal how LIMe integrates
information across layers, pointing to promising directions for future
research.
☆ Reliable Conversational Agents under ASP Control that Understand Natural Language
Efforts have been made to make machines converse like humans in the past few
decades. The recent techniques of Large Language Models (LLMs) make it possible
to have human-like conversations with machines, but LLM's flaws of lacking
understanding and reliability are well documented. We believe that the best way
to eliminate this problem is to use LLMs only as parsers to translate text to
knowledge and vice versa and carry out the conversation by reasoning over this
knowledge using the answer set programming. I have been developing a framework
based on LLMs and ASP to realize reliable chatbots that "understand" human
conversation. This framework has been used to develop task-specific chatbots as
well as socialbots. My future research is focused on making these chatbots
scalable and trainable.
comment: In Proceedings ICLP 2024, arXiv:2502.08453
☆ Answer Set Counting and its Applications
We have focused on Answer Set Programming (ASP), more specifically, answer
set counting, exploring both exact and approximate methodologies. We developed
an exact ASP counter, sharpASP, which utilizes a compact encoding for
propositional formulas, significantly enhancing efficiency compared to existing
methods that often struggle with inefficient encodings. Our evaluations
indicate that sharpASP outperforms current ASP counters on several benchmarks.
In addition, we proposed an approximate ASP counter, named ApproxASP, a
hashing-based counter integrating Gauss-Jordan elimination within the ASP
solver, clingo. As a practical application, we employed ApproxASP for network
reliability estimation, demonstrating superior performance over both
traditional reliability estimators and #SAT-based methods.
comment: In Proceedings ICLP 2024, arXiv:2502.08453
☆ Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles
In this paper, we present a modular system for representing and reasoning
with legal aspects of traffic rules for autonomous vehicles. We focus on a
subset of the United Kingdom's Highway Code (HC) related to junctions. As human
drivers and automated vehicles (AVs) will interact on the roads, especially in
urban environments, we claim that an accessible, unitary, high-level
computational model should exist and be applicable to both users. Autonomous
vehicles introduce a shift in liability that should not bring disadvantages or
increased burden on human drivers. We develop a system "in silico" of the
model. The proposed system is built of three main components: a natural
language interface, using Logical English, which encodes the rules; an internal
representation of the rules in Prolog; and an multi-agent-based simulation
environment, built in NetLogo. The three components interact: Logical English
is translated into and out of Prolog (along with some support code); Prolog and
NetLogo interface via predicates. Such a modular approach enables the different
components to carry different "burdens" in the overall system; it also allows
swapping of modules. Given NetLogo, we can visualize the effect of the modeled
rules as well as validate the system with a simple dynamic running scenario.
Designated agents monitor the behaviour of the vehicles for compliance and
record potential violations where they occur. The information on potential
violations is then utilized by Validators, to determine whether the violation
is punishable, differentiating between exceptions and cases.
comment: In Proceedings ICLP 2024, arXiv:2502.08453
☆ Neuro-Symbolic Contrastive Learning for Cross-domain Inference
Pre-trained language models (PLMs) have made significant advances in natural
language inference (NLI) tasks, however their sensitivity to textual
perturbations and dependence on large datasets indicate an over-reliance on
shallow heuristics. In contrast, inductive logic programming (ILP) excels at
inferring logical relationships across diverse, sparse and limited datasets,
but its discrete nature requires the inputs to be precisely specified, which
limits their application. This paper proposes a bridge between the two
approaches: neuro-symbolic contrastive learning. This allows for smooth and
differentiable optimisation that improves logical accuracy across an otherwise
discrete, noisy, and sparse topological space of logical functions. We show
that abstract logical relationships can be effectively embedded within a
neuro-symbolic paradigm, by representing data as logic programs and sets of
logic rules. The embedding space captures highly varied textual information
with similar semantic logical relations, but can also separate similar textual
relations that have dissimilar logical relations. Experimental results
demonstrate that our approach significantly improves the inference capabilities
of the models in terms of generalisation and reasoning.
comment: In Proceedings ICLP 2024, arXiv:2502.08453
☆ LP-LM: No Hallucinations in Question Answering with Logic Programming
Large language models (LLMs) are able to generate human-like responses to
user queries. However, LLMs exhibit inherent limitations, especially because
they hallucinate. This paper introduces LP-LM, a system that grounds answers to
questions in known facts contained in a knowledge base (KB), facilitated
through semantic parsing in Prolog, and always produces answers that are
reliable.
LP-LM generates a most probable constituency parse tree along with a
corresponding Prolog term for an input question via Prolog definite clause
grammar (DCG) parsing. The term is then executed against a KB of natural
language sentences also represented as Prolog terms for question answering. By
leveraging DCG and tabling, LP-LM runs in linear time in the size of input
sentences for sufficiently many grammar rules. Performing experiments comparing
LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
on even simple questions, unlike LP-LM.
comment: In Proceedings ICLP 2024, arXiv:2502.08453
☆ Thinking beyond the anthropomorphic paradigm benefits LLM research
Anthropomorphism, or the attribution of human traits to technology, is an
automatic and unconscious response that occurs even in those with advanced
technical expertise. In this position paper, we analyze hundreds of thousands
of computer science research articles from the past decade and present
empirical evidence of the prevalence and growth of anthropomorphic terminology
in research on large language models (LLMs). This terminology reflects deeper
anthropomorphic conceptualizations which shape how we think about and conduct
LLM research. We argue these conceptualizations may be limiting, and that
challenging them opens up new pathways for understanding and improving LLMs
beyond human analogies. To illustrate this, we identify and analyze five core
anthropomorphic assumptions shaping prominent methodologies across the LLM
development lifecycle, from the assumption that models must use natural
language for reasoning tasks to the assumption that model capabilities should
be evaluated through human-centric benchmarks. For each assumption, we
demonstrate how non-anthropomorphic alternatives can open new directions for
research and development.
☆ Matina: A Large-Scale 73B Token Persian Text Corpus
Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
Text corpora are essential for training models used in tasks like
summarization, translation, and large language models (LLMs). While various
efforts have been made to collect monolingual and multilingual datasets in many
languages, Persian has often been underrepresented due to limited resources for
data collection and preprocessing. Existing Persian datasets are typically
small and lack content diversity, consisting mainly of weblogs and news
articles. This shortage of high-quality, varied data has slowed the development
of NLP models and open-source LLMs for Persian. Since model performance depends
heavily on the quality of training data, we address this gap by introducing the
Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
and deduplicated to ensure high data quality. We further assess its
effectiveness by training and evaluating transformer-based models on key NLP
tasks. Both the dataset and preprocessing codes are publicly available,
enabling researchers to build on and improve this resource for future Persian
NLP advancements.
☆ RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
Code generation has attracted increasing attention with the rise of Large
Language Models (LLMs). Many studies have developed powerful code LLMs by
synthesizing code-related instruction data and applying supervised fine-tuning.
However, these methods are limited by teacher model distillation and ignore the
potential of iterative refinement by self-generated code. In this paper, we
propose Adaptive Critique Refinement (ACR), which enables the model to refine
itself by self-generated code and external critique, rather than directly
imitating the code responses of the teacher model. Concretely, ACR includes a
composite scoring system with LLM-as-a-Judge to evaluate the quality of code
responses and a selective critique strategy with LLM-as-a-Critic to critique
self-generated low-quality code responses. We develop the RefineCoder series by
iteratively applying ACR, achieving continuous performance improvement on
multiple code generation benchmarks. Compared to the baselines of the same
size, our proposed RefineCoder series can achieve comparable or even superior
performance using less data.
comment: work in process
☆ FLAME: Flexible LLM-Assisted Moderation Engine
Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
The rapid advancement of Large Language Models (LLMs) has introduced
significant challenges in moderating user-model interactions. While LLMs
demonstrate remarkable capabilities, they remain vulnerable to adversarial
attacks, particularly ``jailbreaking'' techniques that bypass content safety
measures. Current content moderation systems, which primarily rely on input
prompt filtering, have proven insufficient, with techniques like Best-of-N
(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
new approach that shifts the focus from input filtering to output moderation.
Unlike traditional circuit-breaking methods that analyze user queries, FLAME
evaluates model responses, offering several key advantages: (1) computational
efficiency in both training and inference, (2) enhanced resistance to BoN
jailbreaking attacks, and (3) flexibility in defining and updating safety
criteria through customizable topic filtering. Our experiments demonstrate that
FLAME significantly outperforms current moderation systems. For example, FLAME
reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
while maintaining low computational overhead. We provide comprehensive
evaluation on various LLMs and analyze the engine's efficiency against the
state-of-the-art jailbreaking. This work contributes to the development of more
robust and adaptable content moderation systems for LLMs.
☆ Musical Heritage Historical Entity Linking
Linking named entities occurring in text to their corresponding entity in a
Knowledge Base (KB) is challenging, especially when dealing with historical
texts. In this work, we introduce Musical Heritage named Entities Recognition,
Classification and Linking (MHERCL), a novel benchmark consisting of manually
annotated sentences extrapolated from historical periodicals of the music
domain. MHERCL contains named entities under-represented or absent in the most
famous KBs. We experiment with several State-of-the-Art models on the Entity
Linking (EL) task and show that MHERCL is a challenging dataset for all of
them. We propose a novel unsupervised EL model and a method to extend
supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
difficulties posed by historical documents. Our experiments reveal that relying
on unsupervised techniques and improving models with logical constraints based
on KGs and heuristics to predict NIL entities (entities not represented in the
KB of reference) results in better EL performance on historical documents.
comment: To appear in Artificial Intelligence Review Journal
☆ Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs
Objectives: Large language models (LLMs) can harness medical knowledge for
intelligent question answering (Q&A), promising support for auxiliary diagnosis
and medical talent cultivation. However, there is a deficiency of highly
efficient retrieval-augmented generation (RAG) frameworks within the domain of
Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
tasks.
Materials and Methods: We introduce the novel approach of knowledge
organization, constructing a tree structure knowledge base with hierarchy. At
inference time, our self-reflection framework retrieves from this knowledge
base, integrating information across chapters. Questions from the TCM Medical
Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
randomly selected as benchmark datasets.
Results: By coupling with GPT-4, the framework can improve the best
performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
the framework improves a total of 18.52 points across dimensions of safety,
consistency, explainability, compliance, and coherence.
Conclusion: The TOSRR framework can effectively improve LLM's capability in
Q&A tasks of TCM.
☆ A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions
Arabic is one of the oldest languages still in use today. As a result,
several Arabic-speaking regions have developed dialects that are unique to
them. Dialect and emotion recognition have various uses in Arabic text
analysis, such as determining an online customer's origin based on their
comments. Furthermore, intelligent chatbots that are aware of a user's emotions
can respond appropriately to the user. Current research in emotion detection in
the Arabic language lacks awareness of how emotions are exhibited in different
dialects, which motivates the work found in this study. This research addresses
the problems of dialect and emotion classification in Arabic. Specifically,
this is achieved by building a novel framework that can identify and predict
Arabic dialects and emotions from a given text. The framework consists of three
modules: A text-preprocessing module, a classification module, and a clustering
module with the novel capability of building new dialect-aware emotion
lexicons. The proposed framework generated a new emotional lexicon for
different dialects. It achieved an accuracy of 88.9% in classifying Arabic
dialects, which outperforms the state-of-the-art results by 6.45 percentage
points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
emotions in the Egyptian and Gulf dialects, respectively.
☆ The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)
This study explored how Vision-Language Models (VLMs) process ignorance
implicatures with visual and linguistic cues. Particularly, we focused on the
effects of contexts (precise and approximate contexts) and modifier types (bare
numerals, superlative, and comparative modifiers), which were considered
pragmatic and semantic factors respectively. Methodologically, we conducted a
truth-value judgment task in visually grounded settings using GPT-4o and Gemini
1.5 Pro. The results indicate that while both models exhibited sensitivity to
linguistic cues (modifier), they failed to process ignorance implicatures with
visual cues (context) as humans do. Specifically, the influence of context was
weaker and inconsistent across models, indicating challenges in pragmatic
reasoning for VLMs. On the other hand, superlative modifiers were more strongly
associated with ignorance implicatures as compared to comparative modifiers,
supporting the semantic view. These findings highlight the need for further
advancements in VLMs to process language-vision information in a
context-dependent way to achieve human-like pragmatic inference.
comment: 13 pages, 3 figures, 3 tables
☆ Logical Reasoning in Large Language Models: A Survey
With the emergence of advanced reasoning models like OpenAI o3 and
DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
reasoning capabilities. However, their ability to perform rigorous logical
reasoning remains an open question. This survey synthesizes recent advancements
in logical reasoning within LLMs, a critical area of AI research. It outlines
the scope of logical reasoning in LLMs, its theoretical foundations, and the
benchmarks used to evaluate reasoning proficiency. We analyze existing
capabilities across different reasoning paradigms - deductive, inductive,
abductive, and analogical - and assess strategies to enhance reasoning
performance, including data-centric tuning, reinforcement learning, decoding
strategies, and neuro-symbolic approaches. The review concludes with future
directions, emphasizing the need for further exploration to strengthen logical
reasoning in AI systems.
☆ A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit
In this paper, we propose an optimized Transformer model that integrates
Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
apply it to fake news classification for the first time. First, we employ the
TF-IDF method to extract features from news texts and transform them into
numeric representations to facilitate subsequent machine learning tasks. Two
sets of experiments are then conducted for fake news detection and
classification: one using a Transformer model optimized only with BiGRU, and
the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
Experimental results show that the BiGRU-optimized Transformer achieves 100%
accuracy on the training set and 99.67% on the test set, while the addition of
the Bayesian algorithm maintains 100% accuracy on the training set and slightly
improves test-set accuracy to 99.73%. This indicates that the Bayesian
algorithm boosts model accuracy by 0.06%, further enhancing the detection
capability for fake news. Moreover, the proposed algorithm converges rapidly at
around the 10th training epoch with accuracy nearing 100%, demonstrating both
its effectiveness and its fast classification ability. Overall, the optimized
Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
excellent continuous learning and detection performance, offering a robust
technical means to combat the spread of fake news in the current era of
information overload.
comment: 6 pages, 7 figures
☆ A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning
With the continuous development of natural language processing (NLP)
technology, text classification tasks have been widely used in multiple
application fields. However, obtaining labeled data is often expensive and
difficult, especially in few-shot learning scenarios. To solve this problem,
this paper proposes a few-shot text classification model based on transfer
learning and meta-learning. The model uses the knowledge of the pre-trained
model for transfer and optimizes the model's rapid adaptability in few-sample
tasks through a meta-learning mechanism. Through a series of comparative
experiments and ablation experiments, we verified the effectiveness of the
proposed method. The experimental results show that under the conditions of few
samples and medium samples, the model based on transfer learning and
meta-learning significantly outperforms traditional machine learning and deep
learning methods. In addition, ablation experiments further analyzed the
contribution of each component to the model performance and confirmed the key
role of transfer learning and meta-learning in improving model accuracy.
Finally, this paper discusses future research directions and looks forward to
the potential of this method in practical applications.
☆ Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking
The pervasiveness of large language models and generative AI in online media
has amplified the need for effective automated fact-checking to assist
fact-checkers in tackling the increasing volume and sophistication of
misinformation. The complex nature of fact-checking demands that automated
fact-checking systems provide explanations that enable fact-checkers to
scrutinise their outputs. However, it is unclear how these explanations should
align with the decision-making and reasoning processes of fact-checkers to be
effectively integrated into their workflows. Through semi-structured interviews
with fact-checking professionals, we bridge this gap by: (i) providing an
account of how fact-checkers assess evidence, make decisions, and explain their
processes; (ii) examining how fact-checkers use automated tools in practice;
and (iii) identifying fact-checker explanation requirements for automated
fact-checking tools. The findings show unmet explanation needs and identify
important criteria for replicable fact-checking explanations that trace the
model's reasoning path, reference specific evidence, and highlight uncertainty
and information gaps.
comment: Conditionally accepted to CHI'25
☆ CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
Role-playing language agents (RPLAs) have emerged as promising applications
of large language models (LLMs). However, simulating established characters
presents a challenging task for RPLAs, due to the lack of authentic character
datasets and nuanced evaluation methods using such data. In this paper, we
present CoSER, a collection of a high-quality dataset, open models, and an
evaluation protocol towards effective RPLAs of established characters. The
CoSER dataset covers 17,966 characters from 771 renowned books. It provides
authentic dialogues with real-world intricacies, as well as diverse data types
such as conversation setups, character experiences and internal thoughts.
Drawing from acting methodology, we introduce given-circumstance acting for
training and evaluating role-playing LLMs, where LLMs sequentially portray
multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
Extensive experiments demonstrate the value of the CoSER dataset for RPLA
training, evaluation and retrieval. Moreover, CoSER 70B exhibits
state-of-the-art performance surpassing or matching GPT-4o on our evaluation
and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
the InCharacter and LifeChoice benchmarks respectively.
☆ Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables
Retrieval-augmented generation (RAG) is a key technique for leveraging
external knowledge and reducing hallucinations in large language models (LLMs).
However, RAG still struggles to fully prevent hallucinated responses. To
address this, it is essential to identify samples prone to hallucination or
guide LLMs toward correct responses, which experts then annotate to develop
high-quality datasets for refining LLMs. However, the growing scarcity of such
datasets makes their creation challenging. This paper proposes using the vast
amount of conversations from widespread LLM usage to build these datasets,
training LLMs to avoid hallucination-prone questions while accurately
responding to manageable ones. Given the impracticality of expert-annotating
all conversation records, the paper introduces AL4RAG, which uses active
learning to select the most suitable conversation samples for annotation,
optimizing performance within an annotation budget. Additionally, recognizing
that traditional active learning methods are not fully compatible with RAG due
to unsuitable distance metrics, we develop a novel sample distance measurement
for RAG active learning. Extensive experiments show that our method
consistently outperforms baselines across multiple metrics.
☆ An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
This paper investigates data selection and model merging methodologies aimed
at incorporating advanced reasoning capabilities such as those of DeepSeek R1
into language-specific large language models (LLMs), with a particular focus on
the Thai LLM. Our goal is to enhance the reasoning capabilities of
language-specific LLMs while maintaining their target language abilities.
DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
such as English and Chinese. However, low-resource languages remain underserved
due to the dominance of English-centric training data and model optimizations,
which limit performance in these languages. This limitation results in
unreliable code-switching and diminished effectiveness on tasks in low-resource
languages. Meanwhile, local and regional LLM initiatives have attempted to
bridge this gap by developing language-specific LLMs that focus on improving
local linguistic fidelity. We demonstrate that, with only publicly available
datasets and a computational budget of $120, it is possible to enhance the
reasoning capabilities of language-specific LLMs to match the level of DeepSeek
R1, without compromising their performance on target language tasks.
comment: 9 pages
☆ Typhoon T1: An Open Thai Reasoning Model
This paper introduces Typhoon T1, an open effort to develop an open Thai
reasoning model. A reasoning model is a relatively new type of generative model
built on top of large language models (LLMs). A reasoning model generates a
long chain of thought before arriving at a final answer, an approach found to
improve performance on complex tasks. However, details on developing such a
model are limited, especially for reasoning models that can generate traces in
a low-resource language. Typhoon T1 presents an open effort that dives into the
details of developing a reasoning model in a more cost-effective way by
leveraging supervised fine-tuning using open datasets, instead of reinforcement
learning. This paper shares the details about synthetic data generation and
training, as well as our dataset and model weights. Additionally, we provide
insights gained from developing a reasoning model that generalizes across
domains and is capable of generating reasoning traces in a low-resource
language, using Thai as an example. We hope this open effort provides a
foundation for further research in this field.
comment: 25 pages, 6 figures
☆ Diversity Enhances an LLM's Performance in RAG and Long-context Task
The rapid advancements in large language models (LLMs) have highlighted the
challenge of context window limitations, primarily due to the quadratic time
complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
context window length). This constraint impacts tasks such as
retrieval-augmented generation (RAG) in question answering (Q\&A) and long
context summarization. A common approach involves selecting content with the
highest similarity to the query; however, this often leads to redundancy and
the exclusion of diverse yet relevant information. Building on principles from
Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
integrate diversity into the content selection process. Our findings reveal
that incorporating diversity substantially increases the recall of selecting
relevant sentences or chunks before LLM-based Q\&A and summarization. These
results highlight the importance of maintaining diversity in future LLM
applications to further improve summarization and Q\&A outcomes.
☆ Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech
This paper makes three contributions. First, via a substantial corpus of
1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
focus both on positive and negative content. In particular, we construct a
fine-grained hope speech classifier that detects positive (hope speech),
negative, neutral, and irrelevant content. Second, in consultation with a
public health expert specializing on LGBTQ+ health, we conduct an annotation
study with a balanced and diverse political representation and release a
dataset of 3,750 instances with fine-grained labels and detailed annotator
demographic information. Finally, beyond providing a vital resource for the
LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
reveal (1) strong association between rater political beliefs and how they rate
content relevant to a marginalized community; (2) models trained on individual
political beliefs exhibit considerable in-the-wild disagreement; and (3)
zero-shot large language models (LLMs) align more with liberal raters.
☆ Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning NAACL 2025
Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
Language models are aligned to the collective voice of many, resulting in
generic outputs that do not align with specific users' styles. In this work, we
present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
that personalizes language models for text generation tasks with fewer than 10
examples per user. TICL iteratively expands an in-context learning prompt via a
trial-error-explain process, adding model-generated negative samples and
explanations that provide fine-grained guidance towards a specific user's
style. TICL achieves favorable win rates on pairwise comparisons with
LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
outperforms competitive tuning-free baselines for personalized alignment tasks
of writing emails, essays and news articles. Both lexical and qualitative
analyses show that the negative samples and explanations enable language models
to learn stylistic context more effectively and overcome the bias towards
structural and formal phrases observed in their zero-shot outputs. By
front-loading inference compute to create a user-specific in-context learning
prompt that does not require extra generation steps at test time, TICL presents
a novel yet simple approach for personalized alignment.
comment: NAACL 2025 Findings
☆ Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning
Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
The deployment of Large Language Models (LLM) on mobile devices offers
significant potential for medical applications, enhancing privacy, security,
and cost-efficiency by eliminating reliance on cloud-based services and keeping
sensitive health data local. However, the performance and accuracy of on-device
LLMs in real-world medical contexts remain underexplored. In this study, we
benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
accuracy, computational efficiency, and thermal limitation across various
mobile devices. Our results indicate that compact general-purpose models like
Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
deploying LLMs on older devices remains feasible, with memory constraints
posing a greater challenge than raw processing power. Our study underscores the
potential of on-device LLMs for healthcare while emphasizing the need for more
efficient inference and models tailored to real-world clinical reasoning.
☆ Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding
Fenella Harcourt, Naderdel Piero, Gilbert Sutherland, Daphne Holloway, Harriet Bracknell, Julian Ormsby
Token representations in high-dimensional latent spaces often exhibit
redundancy, limiting computational efficiency and reducing structural coherence
across model layers. Hierarchical latent space folding introduces a structured
transformation mechanism that enforces a multi-scale organization within
learned embeddings, refining representational compactness while preserving
essential contextual distinctions. The proposed approach incorporates dynamic
folding operations that iteratively adjust token embeddings through structured
transformations, influencing both short-range and long-range dependencies in
sequential processing tasks. Empirical evaluation demonstrates a reduction in
representational variance across layers, contributing to more stable perplexity
distributions and enhancing predictive confidence in text generation. The
structured redistribution of attention head utilization leads to more efficient
allocation of computational resources, particularly in deeper layers, where
hierarchical refinements improve contextual abstraction. Comparative analysis
of activation sparsity patterns suggests that hierarchical adjustments
selectively reinforce critical pathways while reducing computational overhead
in non-essential regions of the model. Statistical assessments of token
reordering frequencies reveal that hierarchical modifications introduce subtle
shifts in sequential dependencies, improving contextual alignment while
maintaining syntactic correctness. Computational trade-offs associated with
hierarchical folding introduce marginal increases in training time per epoch,
yet empirical findings indicate that inference efficiency benefits from the
structured representation adjustments. The results highlight the impact of
hierarchical latent space folding on optimizing model performance through
improved representation structuring and computational efficiency.
☆ The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding NAACL 2025
In a systematic way, we investigate a widely asked question: Do LLMs really
understand what they say?, which relates to the more familiar term Stochastic
Parrot. To this end, we propose a summative assessment over a carefully
designed physical concept understanding task, PhysiCo. Our task alleviates the
memorization issue via the usage of grid-format inputs that abstractly describe
physical phenomena. The grids represents varying levels of understanding, from
the core phenomenon, application examples to analogies to other abstract
patterns in the grid world. A comprehensive study on our task demonstrates: (1)
state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag
behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs,
as they fail on our grid task but can describe and recognize the same concepts
well in natural language; (3) our task challenges the LLMs due to intrinsic
difficulties rather than the unfamiliar grid format, as in-context learning and
fine-tuning on same formatted data added little to their performance.
comment: NAACL 2025 Main Conference. First 5 authors contributed equally.
Project page: https://physico-benchmark.github.io/
☆ Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis
Large language models (LLMs) have demonstrated significant utilities in
real-world applications, exhibiting impressive capabilities in natural language
processing and understanding. Benchmark evaluations are crucial for assessing
the capabilities of LLMs as they can provide a comprehensive assessment of
their strengths and weaknesses. However, current evaluation methods often
overlook the inherent randomness of LLMs by employing deterministic generation
strategies or relying on a single random sample, resulting in unaccounted
sampling variance and unreliable benchmark score estimates. In this paper, we
propose a hierarchical statistical model that provides a more comprehensive
representation of the benchmarking process by incorporating both benchmark
characteristics and LLM randomness. We show that leveraging multiple
generations improves the accuracy of estimating the benchmark score and reduces
variance. We also introduce $\mathbb P\left(\text{correct}\right)$, a
prompt-level difficulty score based on correct ratios, providing fine-grained
insights into individual prompts. Additionally, we create a data map that
visualizes difficulty and semantic prompts, enabling error detection and
quality control in benchmark construction.
comment: 10 pages, 1 table, 4 Figures
☆ Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Synthetically-generated data plays an increasingly larger role in training
large language models. However, while synthetic data has been found to be
useful, studies have also shown that without proper curation it can cause LLM
performance to plateau, or even "collapse", after many training iterations. In
this paper, we formalize this question and develop a theoretical framework to
investigate how much curation is needed in order to ensure that LLM performance
continually improves. We find that the requirements are nearly minimal. We
describe a training procedure that converges to an optimal LLM even if almost
all of the non-synthetic training data is of poor quality. Our analysis is
inspired by boosting, a classic machine learning technique that leverages a
very weak learning algorithm to produce an arbitrarily good classifier. Our
training procedure subsumes many recently proposed methods for training LLMs on
synthetic data, and thus our analysis sheds light on why they are successful,
and also suggests opportunities for future improvement. We present experiments
that validate our theory, and show that dynamically focusing labeling resources
on the most challenging examples -- in much the same way that boosting focuses
the efforts of the weak learner -- leads to improved performance.
☆ CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality
We introduce CopySpec, an innovative technique designed to tackle the
inefficiencies LLMs face when generating responses that closely resemble
previous outputs. CopySpec identifies repeated sequences in the model's chat
history and speculates that the same tokens will follow, enabling seamless
copying without compromising output quality or requiring additional GPU memory.
To evaluate the effectiveness of our approach, we conducted experiments using
five LLMs and five datasets: MT-Bench, CNN/DM, GSM-8K, HumanEval, and our newly
created dataset, MT-Redundant. MT-Redundant, introduced in this paper,
transforms the second turn of MT-Bench into a request for variations of the
first turn's answer, simulating real-world scenarios where users request
modifications to prior responses. Our results demonstrate significant
speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select
MT-Redundant categories, and 2.66x on the third turn of GSM-8K's
self-correction tasks. Moreover, we show that CopySpec integrates seamlessly
with speculative decoding, yielding an average 49% additional speed-up over
speculative decoding for the second turn of MT-Redundant across all eight
categories. While LLMs, even with speculative decoding, suffer from slower
inference as context sizes grow, CopySpec leverages the expanded context to
accelerate inference, making it faster as the context size increases. Our code
and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
comment: 33 pages, 18 figures, 19 tables
☆ PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O. Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G. Elmore, Ranjay Krishna, Linda Shapiro
Diagnosing diseases through histopathology whole slide images (WSIs) is
fundamental in modern pathology but is challenged by the gigapixel scale and
complexity of WSIs. Trained histopathologists overcome this challenge by
navigating the WSI, looking for relevant patches, taking notes, and compiling
them to produce a final holistic diagnostic. Traditional AI approaches, such as
multiple instance learning and transformer-based models, fail short of such a
holistic, iterative, multi-scale diagnostic procedure, limiting their adoption
in the real-world. We introduce PathFinder, a multi-modal, multi-agent
framework that emulates the decision-making process of expert pathologists.
PathFinder integrates four AI agents, the Triage Agent, Navigation Agent,
Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs,
gather evidence, and provide comprehensive diagnoses with natural language
explanations. The Triage Agent classifies the WSI as benign or risky; if risky,
the Navigation and Description Agents iteratively focus on significant regions,
generating importance maps and descriptive insights of sampled patches.
Finally, the Diagnosis Agent synthesizes the findings to determine the
patient's diagnostic classification. Our Experiments show that PathFinder
outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while
offering inherent explainability through natural language descriptions of
diagnostically relevant patches. Qualitative analysis by pathologists shows
that the Description Agent's outputs are of high quality and comparable to
GPT-4o. PathFinder is also the first AI-based system to surpass the average
performance of pathologists in this challenging melanoma classification task by
9%, setting a new record for efficient, accurate, and interpretable AI-assisted
diagnostics in pathology. Data, code and models available at
https://pathfinder-dx.github.io/
☆ InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
In modern large language models (LLMs), handling very long context lengths
presents significant challenges as it causes slower inference speeds and
increased memory costs. Additionally, most existing pre-trained LLMs fail to
generalize beyond their original training sequence lengths. To enable efficient
and practical long-context utilization, we introduce InfiniteHiP, a novel, and
practical LLM inference framework that accelerates processing by dynamically
eliminating irrelevant context tokens through a modular hierarchical token
pruning algorithm. Our method also allows generalization to longer sequences by
selectively applying various RoPE adjustment methods according to the internal
attention patterns within LLMs. Furthermore, we offload the key-value cache to
host memory during inference, significantly reducing GPU memory pressure. As a
result, InfiniteHiP enables the processing of up to 3 million tokens on a
single L40s 48GB GPU -- 3x larger -- without any permanent loss of context
information. Our framework achieves an 18.95x speedup in attention decoding for
a 1 million token context without requiring additional training. We implement
our method in the SGLang framework and demonstrate its effectiveness and
practicality through extensive evaluations.
comment: 21 pages
☆ Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs
Fact-checking is necessary to address the increasing volume of
misinformation. Traditional fact-checking relies on manual analysis to verify
claims, but it is slow and resource-intensive. This study establishes baseline
comparisons for Automated Fact-Checking (AFC) using Large Language Models
(LLMs) across multiple labeling schemes (binary, three-class, five-class) and
extends traditional claim verification by incorporating analysis, verdict
classification, and explanation in a structured setup to provide comprehensive
justifications for real-world claims. We evaluate Llama-3 models of varying
sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024)
using evidence retrieved via restricted web searches. We utilize TIGERScore as
a reference-free evaluation metric to score the justifications. Our results
show that larger LLMs consistently outperform smaller LLMs in classification
accuracy and justification quality without fine-tuning. We find that smaller
LLMs in a one-shot scenario provide comparable task performance to fine-tuned
Small Language Models (SLMs) with large context sizes, while larger LLMs
consistently surpass them. Evidence integration improves performance across all
models, with larger LLMs benefiting most. Distinguishing between nuanced labels
remains challenging, emphasizing the need for further exploration of labeling
schemes and alignment with evidences. Our findings demonstrate the potential of
retrieval-augmented AFC with LLMs.
☆ Can Uniform Meaning Representation Help GPT-4 Translate from Indigenous Languages?
While ChatGPT and GPT-based models are able to effectively perform many tasks
without additional fine-tuning, they struggle with related to extremely
low-resource languages and indigenous languages. Uniform Meaning Representation
(UMR), a semantic representation designed to capture the meaning of texts in
many languages, is well-poised to be leveraged in the development of
low-resource language technologies. In this work, we explore the downstream
technical utility of UMR for low-resource languages by incorporating it into
GPT-4 prompts. Specifically, we examine the ability of GPT-4 to perform
translation from three indigenous languages (Navajo, Ar\'apaho, and Kukama),
with and without demonstrations, as well as with and without UMR annotations.
Ultimately we find that in the majority of our test cases, integrating UMR into
the prompt results in a statistically significant increase in performance,
which is a promising indication of future applications of the UMR formalism.
☆ Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication NAACL 2025
Weicheng Ma, Hefan Zhang, Ivory Yang, Shiyu Ji, Joice Chen, Farnoosh Hashemi, Shubham Mohole, Ethan Gearey, Michael Macy, Saeed Hassanpour, Soroush Vosoughi
Large Language Models (LLMs) have shown proficiency in generating persuasive
dialogue, yet concerns about the fluency and sophistication of their outputs
persist. This paper presents a multi-LLM communication framework designed to
enhance the generation of persuasive data automatically. This framework
facilitates the efficient production of high-quality, diverse linguistic
content with minimal human oversight. Through extensive evaluations, we
demonstrate that the generated data excels in naturalness, linguistic
diversity, and the strategic use of persuasion, even in complex scenarios
involving social taboos. The framework also proves adept at generalizing across
novel contexts. Our results highlight the framework's potential to
significantly advance research in both computational and social science domains
concerning persuasive communication.
comment: Accepted to NAACL 2025 Main Conference
☆ LLM-Enhanced Multiple Instance Learning for Joint Rumor and Stance Detection with Social Context Information
The proliferation of misinformation, such as rumors on social media, has
drawn significant attention, prompting various expressions of stance among
users. Although rumor detection and stance detection are distinct tasks, they
can complement each other. Rumors can be identified by cross-referencing
stances in related posts, and stances are influenced by the nature of the
rumor. However, existing stance detection methods often require post-level
stance annotations, which are costly to obtain. We propose a novel LLM-enhanced
MIL approach to jointly predict post stance and claim class labels, supervised
solely by claim labels, using an undirected microblog propagation model. Our
weakly supervised approach relies only on bag-level labels of claim veracity,
aligning with multi-instance learning (MIL) principles. To achieve this, we
transform the multi-class problem into multiple MIL-based binary classification
problems. We then employ a discriminative attention layer to aggregate the
outputs from these classifiers into finer-grained classes. Experiments
conducted on three rumor datasets and two stance datasets demonstrate the
effectiveness of our approach, highlighting strong connections between rumor
veracity and expressed stances in responding posts. Our method shows promising
performance in joint rumor and stance detection compared to the
state-of-the-art methods.
comment: Accepted by ACM TIST
☆ BrainWavLM: Fine-tuning Speech Representations with Brain Responses to Language
Speech encoding models use auditory representations to predict how the human
brain responds to spoken language stimuli. Most performant encoding models
linearly map the hidden states of artificial neural networks to brain data, but
this linear restriction may limit their effectiveness. In this work, we use
low-rank adaptation (LoRA) to fine-tune a WavLM-based encoding model end-to-end
on a brain encoding objective, producing a model we name BrainWavLM. We show
that fine-tuning across all of cortex improves average encoding performance
with greater stability than without LoRA. This improvement comes at the expense
of low-level regions like auditory cortex (AC), but selectively fine-tuning on
these areas improves performance in AC, while largely retaining gains made in
the rest of cortex. Fine-tuned models generalized across subjects, indicating
that they learned robust brain-like representations of the speech stimuli.
Finally, by training linear probes, we showed that the brain data strengthened
semantic representations in the speech model without any explicit annotations.
Our results demonstrate that brain fine-tuning produces best-in-class speech
encoding models, and that non-linear methods have the potential to bridge the
gap between artificial and biological representations of semantics.
comment: 15 pages, 8 figures
☆ EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
Clinton J. Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, Dan Hendrycks
As language models master existing reasoning benchmarks, we need new
challenges to evaluate their cognitive frontiers. Puzzle-solving events are
rich repositories of challenging multimodal problems that test a wide range of
advanced reasoning and knowledge capabilities, making them a unique testbed for
evaluating frontier language models. We introduce EnigmaEval, a dataset of
problems and solutions derived from puzzle competitions and events that probes
models' ability to perform implicit knowledge synthesis and multi-step
deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle
solving challenges models to discover hidden connections between seemingly
unrelated pieces of information to uncover solution paths. The benchmark
comprises 1184 puzzles of varying complexity -- each typically requiring teams
of skilled solvers hours to days to complete -- with unambiguous, verifiable
solutions that enable efficient evaluation. State-of-the-art language models
achieve extremely low accuracy on these puzzles, even lower than other
difficult benchmarks such as Humanity's Last Exam, unveiling models'
shortcomings when challenged with problems requiring unstructured and lateral
reasoning.
♻ ☆ Transformers Learn Low Sensitivity Functions: Investigations and Implications ICLR 2025
Transformers achieve state-of-the-art accuracy and robustness across many
tasks, but an understanding of their inductive biases and how those biases
differ from other neural network architectures remains elusive. In this work,
we identify the sensitivity of the model to token-wise random perturbations in
the input as a unified metric which explains the inductive bias of transformers
across different data modalities and distinguishes them from other
architectures. We show that transformers have lower sensitivity than MLPs,
CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show
that this low-sensitivity bias has important implications: i) lower sensitivity
correlates with improved robustness; it can also be used as an efficient
intervention to further improve the robustness of transformers; ii) it
corresponds to flatter minima in the loss landscape; and iii) it can serve as a
progress measure for grokking. We support these findings with theoretical
results showing (weak) spectral bias of transformers in the NTK regime, and
improved robustness due to the lower sensitivity. The code is available at
https://github.com/estija/sensitivity.
comment: ICLR 2025. 24 pages, 19 figures, 3 tables
♻ ☆ Hello Again! LLM-powered Personalized Agent for Long-term Dialogue NAACL 2025
Open-domain dialogue systems have seen remarkable advancements with the
development of large language models (LLMs). Nonetheless, most existing
dialogue systems predominantly focus on brief single-session interactions,
neglecting the real-world demands for long-term companionship and personalized
interactions with chatbots. Crucial to addressing this real-world need are
event summary and persona management, which enable reasoning for appropriate
long-term dialogue responses. Recent progress in the human-like cognitive and
reasoning capabilities of LLMs suggests that LLM-based agents could
significantly enhance automated perception, decision-making, and
problem-solving. In response to this potential, we introduce a model-agnostic
framework, the Long-term Dialogue Agent (LD-Agent), which incorporates three
independently tunable modules dedicated to event perception, persona
extraction, and response generation. For the event memory module, long and
short-term memory banks are employed to separately focus on historical and
ongoing sessions, while a topic-based retrieval mechanism is introduced to
enhance the accuracy of memory retrieval. Furthermore, the persona module
conducts dynamic persona modeling for both users and agents. The integration of
retrieved memories and extracted personas is subsequently fed into the
generator to induce appropriate responses. The effectiveness, generality, and
cross-domain capabilities of LD-Agent are empirically demonstrated across
various illustrative benchmarks, models, and tasks. The code is released at
https://github.com/leolee99/LD-Agent.
comment: Accepted to NAACL 2025
♻ ☆ Evaluating Zero-Shot Long-Context LLM Compression
This study evaluates the effectiveness of zero-shot compression techniques on
large language models (LLMs) under long-context. We identify the tendency for
computational errors to increase under long-context when employing certain
compression methods. We propose a hypothesis to explain the varied behavior of
different LLM compression techniques and explore remedies to mitigate the
performance decline observed in some techniques under long-context. This is a
course report for COS 598D Machine Learning and Systems by Prof. Kai Li at
Princeton University. Due to limited computational resources, our experiments
were conducted only on LLaMA-2-7B-32K.
♻ ☆ Salamandra Technical Report
Aitor Gonzalez-Agirre, Marc Pàmies, Joan Llop, Irene Baucells, Severino Da Dalt, Daniel Tamayo, José Javier Saiz, Ferran Espuña, Jaume Prats, Javier Aula-Blasco, Mario Mina, Iñigo Pikabea, Adrián Rubio, Alexander Shvets, Anna Sallés, Iñaki Lacunza, Jorge Palomar, Júlia Falcão, Lucía Tormo, Luis Vasquez-Reina, Montserrat Marimon, Oriol Pareras, Valle Ruiz-Fernández, Marta Villegas
This work introduces Salamandra, a suite of open-source decoder-only large
language models available in three different sizes: 2, 7, and 40 billion
parameters. The models were trained from scratch on highly multilingual data
that comprises text in 35 European languages and code. Our carefully curated
corpus is made exclusively from open-access data compiled from a wide variety
of sources. Along with the base models, supplementary checkpoints that were
fine-tuned on public-domain instruction data are also released for chat
applications. Additionally, we also share our preliminary experiments on
multimodality, which serve as proof-of-concept to showcase potential
applications for the Salamandra family. Our extensive evaluations on
multilingual benchmarks reveal that Salamandra has strong capabilities,
achieving competitive performance when compared to similarly sized open-source
models. We provide comprehensive evaluation results both on standard downstream
tasks as well as key aspects related to bias and safety.With this technical
report, we intend to promote open science by sharing all the details behind our
design choices, data curation strategy and evaluation methodology. In addition
to that, we deviate from the usual practice by making our training and
evaluation scripts publicly accessible. We release all models under a
permissive Apache 2.0 license in order to foster future research and facilitate
commercial use, thereby contributing to the open-source ecosystem of large
language models.
♻ ☆ Fine-Tuned LLMs are "Time Capsules" for Tracking Societal Bias Through Books NAACL 2025
Books, while often rich in cultural insights, can also mirror societal biases
of their eras - biases that Large Language Models (LLMs) may learn and
perpetuate during training. We introduce a novel method to trace and quantify
these biases using fine-tuned LLMs. We develop BookPAGE, a corpus comprising
593 fictional books across seven decades (1950-2019), to track bias evolution.
By fine-tuning LLMs on books from each decade and using targeted prompts, we
examine shifts in biases related to gender, sexual orientation, race, and
religion. Our findings indicate that LLMs trained on decade-specific books
manifest biases reflective of their times, with both gradual trends and notable
shifts. For example, model responses showed a progressive increase in the
portrayal of women in leadership roles (from 8% to 22%) from the 1950s to
2010s, with a significant uptick in the 1990s (from 4% to 12%), possibly
aligning with third-wave feminism. Same-sex relationship references increased
markedly from the 1980s to 2000s (from 0% to 10%), mirroring growing LGBTQ+
visibility. Concerningly, negative portrayals of Islam rose sharply in the
2000s (26% to 38%), likely reflecting post-9/11 sentiments. Importantly, we
demonstrate that these biases stem mainly from the books' content and not the
models' architecture or initial training. Our study offers a new perspective on
societal bias trends by bridging AI, literary studies, and social science
research.
comment: 9 pages (excluding references), accepted to NAACL 2025
♻ ☆ Measuring Human Contribution in AI-Assisted Content Generation
Yueqi Xie, Tao Qi, Jingwei Yi, Xiyuan Yang, Ryan Whalen, Junming Huang, Qian Ding, Yu Xie, Xing Xie, Fangzhao Wu
With the growing prevalence of generative artificial intelligence (AI), an
increasing amount of content is no longer exclusively generated by humans but
by generative AI models with human guidance. This shift presents notable
challenges for the delineation of originality due to the varying degrees of
human contribution in AI-assisted works. This study raises the research
question of measuring human contribution in AI-assisted content generation and
introduces a framework to address this question that is grounded in information
theory. By calculating mutual information between human input and AI-assisted
output relative to self-information of AI-assisted output, we quantify the
proportional information contribution of humans in content generation. Our
experimental results demonstrate that the proposed measure effectively
discriminates between varying degrees of human contribution across multiple
creative domains. We hope that this work lays a foundation for measuring human
contributions in AI-assisted content generation in the era of generative AI.
♻ ☆ Rationalization Models for Text-to-SQL
We introduce a framework for generating Chain-of-Thought (CoT) rationales to
enhance text-to-SQL model fine-tuning. These rationales consist of intermediate
SQL statements and explanations, serving as incremental steps toward
constructing the final SQL query. The process begins with manually annotating a
small set of examples, which are then used to prompt a large language model in
an iterative, dynamic few-shot knowledge distillation procedure from a teacher
model. A rationalization model is subsequently trained on the validated
decomposed queries, enabling extensive synthetic CoT annotations for
text-to-SQL datasets. To evaluate the approach, we fine-tune small language
models with and without these rationales on the BIRD dataset. Results indicate
that step-by-step query generation improves execution accuracy, especially for
moderately and highly complex queries, while also enhancing explainability.
♻ ☆ SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
As a powerful all-weather Earth observation tool, synthetic aperture radar
(SAR) remote sensing enables critical military reconnaissance, maritime
surveillance, and infrastructure monitoring. Although Vision language models
(VLMs) have made remarkable progress in natural language processing and image
understanding, their applications remain limited in professional domains due to
insufficient domain expertise. This paper innovatively proposes the first
large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which
contains approximately 2 million high-quality image-text pairs, encompasses
diverse scenarios with detailed target annotations. This dataset not only
supports several key tasks such as visual understanding and object detection
tasks, but also has unique innovative aspects: this study develop a
visual-language dataset and benchmark for the SAR domain, enabling and
evaluating VLMs' capabilities in SAR image interpretation, which provides a
paradigmatic framework for constructing multimodal datasets across various
remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the
effectiveness of the dataset has been fully verified. The project will be
released at https://github.com/JimmyMa99/SARChat.
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLM agents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLM agents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of OM
tools. Our framework is implemented in a proof-of-concept system. Evaluations
of three Ontology Alignment Evaluation Initiative (OAEI) tracks over
state-of-the-art OM systems show that our system can achieve results very close
to the long-standing best performance on simple OM tasks and can significantly
improve the performance on complex and few-shot OM tasks.
comment: 19 pages, 12 figures, 3 tables
♻ ☆ Better Embeddings with Coupled Adam
Despite their remarkable capabilities, LLMs learn word representations that
exhibit the undesirable yet poorly understood feature of anisotropy. In this
paper, we argue that the second moment in Adam is a cause of anisotropic
embeddings, and suggest a modified optimizer called Coupled Adam to mitigate
the problem. Our experiments demonstrate that Coupled Adam significantly
improves the quality of embeddings, while also leading to better upstream and
downstream performance on large enough datasets.
comment: 17 pages, 8 figures; figures corrected
♻ ☆ Improving Factual Consistency of News Summarization by Contrastive Preference Optimization
Huawen Feng, Yan Fan, Xiong Liu, Ting-En Lin, Zekun Yao, Yuchuan Wu, Fei Huang, Yongbin Li, Qianli Ma
Despite the recent progress in news summarization made by large language
models (LLMs), they often generate summaries that are factually inconsistent
with original articles, known as "hallucinations" in text generation. Unlike
previous small models (e.g., BART, T5), current LLMs make fewer silly mistakes
but more sophisticated ones, such as imposing cause and effect, adding false
details, overgeneralizing, etc. These hallucinations are challenging to detect
through traditional methods, which poses great challenges for improving the
factual consistency of text summarization. In this paper, we propose
Contrastive Preference Optimization (CPO) to disentangle the LLMs' propensities
to generate faithful and fake content. Furthermore, we adopt a probing-based
specific training method to improve their capacity of distinguishing two types
of propensities. In this way, LLMs can execute the instructions more accurately
and have enhanced perception of hallucinations. Experimental results show that
CPO significantly improves the reliability of summarization based on LLMs.
♻ ☆ The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units NAACL 2025
Large language models (LLMs) exhibit remarkable capabilities on not just
language tasks, but also various tasks that are not linguistic in nature, such
as logical reasoning and social inference. In the human brain, neuroscience has
identified a core language system that selectively and causally supports
language processing. We here ask whether similar specialization for language
emerges in LLMs. We identify language-selective units within 18 popular LLMs,
using the same localization approach that is used in neuroscience. We then
establish the causal role of these units by demonstrating that ablating LLM
language-selective units -- but not random units -- leads to drastic deficits
in language tasks. Correspondingly, language-selective LLM units are more
aligned to brain recordings from the human language system than random units.
Finally, we investigate whether our localization method extends to other
cognitive domains: while we find specialized networks in some LLMs for
reasoning and social capabilities, there are substantial differences among
models. These findings provide functional and causal evidence for
specialization in large language models, and highlight parallels with the
functional organization in the brain.
comment: NAACL 2025
♻ ☆ WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng, Pu Zhao, Qingfeng Sun, Can Xu, Fangkai Yang, Lu Wang, Qianli Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Despite recent progress achieved by code large language models (LLMs), their
remarkable abilities are largely dependent on fine-tuning on the high-quality
data, posing challenges for data collection and annotation. To address this,
current methods often design various data flywheels to collect complex code
instructions, enabling models to handle more intricate tasks. However, these
approaches typically rely on off-the-shelf datasets and data augmentation from
a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which
restricts the diversity of the constructed data and makes it prone to systemic
biases. In this paper, we propose WarriorCoder, a novel paradigm learns from
expert battles to address these limitations. Specifically, we create an arena
where leading expert code LLMs challenge each other, with evaluations conducted
by impartial judges. This competitive framework generates novel training data
from scratch, leveraging the strengths of all participants. Experimental
results show that WarriorCoder achieves state-of-the-art performance compared
to previous models of the same size, even without relying on proprietary LLMs.
♻ ☆ Generative Prompt Internalization NAACL 2025
Prompts used in recent large language model based applications are often
fixed and lengthy, leading to significant computational overhead. To address
this challenge, we propose Generative Prompt Internalization (GenPI), a
lightweight method that employs a joint training approach. GenPI not only
replicates the behavior of models with prompt inputs but also generates the
content of the prompt along with reasons for why the model's behavior should
change accordingly. We demonstrate that our approach effectively internalizes
complex prompts across various agent-based application scenarios. For effective
training without interactions with the dedicated environments, we introduce a
data synthesis technique that autonomously collects conversational datasets by
swapping the roles of the agent and environment. This method is especially
useful in scenarios where only a predefined prompt is available without a
corresponding training dataset. By internalizing complex prompts, Generative
Prompt Internalization enables high performance and efficient inference without
the need for explicit prompts.
comment: NAACL 2025 (Main Conference)
♻ ☆ Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation
Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu J. Han, Hang Su
Faithfulness evaluators based on large language models (LLMs) are often
fooled by the fluency of the text and struggle with identifying errors in the
summaries. We propose an approach to summary faithfulness evaluation in which
multiple LLM-based agents are assigned initial stances (regardless of what
their belief might be) and forced to come up with a reason to justify the
imposed belief, thus engaging in a multi-round debate to reach an agreement.
The uniformly distributed initial assignments result in a greater diversity of
stances leading to more meaningful debates and ultimately more errors
identified. Furthermore, by analyzing the recent faithfulness evaluation
datasets, we observe that naturally, it is not always the case for a summary to
be either faithful to the source document or not. We therefore introduce a new
dimension, ambiguity, and a detailed taxonomy to identify such special cases.
Experiments demonstrate our approach can help identify ambiguities, and have
even a stronger performance on non-ambiguous summaries.
♻ ☆ An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases
Large language models (LLMs) can exhibit bias in a variety of ways. Such
biases can create or exacerbate unfair outcomes for certain groups within a
protected attribute, including, but not limited to sex, race, sexual
orientation, or age. In this paper, we propose a decision framework that allows
practitioners to determine which bias and fairness metrics to use for a
specific LLM use case. To establish the framework, we define bias and fairness
risks for LLMs, map those risks to a taxonomy of LLM use cases, and then define
various metrics to assess each type of risk. Instead of focusing solely on the
model itself, we account for both prompt-specific- and model-specific-risk by
defining evaluations at the level of an LLM use case, characterized by a model
and a population of prompts. Furthermore, because all of the evaluation metrics
are calculated solely using the LLM output, our proposed framework is highly
practical and easily actionable for practitioners. For streamlined
implementation, all evaluation metrics included in the framework are offered in
this paper's companion Python toolkit, LangFair. Finally, our experiments
demonstrate substantial variation in bias and fairness across use cases,
underscoring the importance of use-case-level assessments.
comment: LangFair repository: https://github.com/cvs-health/langfair
♻ ☆ On-Device Emoji Classifier Trained with GPT-based Data Augmentation for a Mobile Keyboard
Emojis improve communication quality among smart-phone users that use mobile
keyboards to exchange text. To predict emojis for users based on input text, we
should consider the on-device low memory and time constraints, ensure that the
on-device emoji classifier covers a wide range of emoji classes even though the
emoji dataset is typically imbalanced, and adapt the emoji classifier output to
user favorites. This paper proposes an on-device emoji classifier based on
MobileBert with reasonable memory and latency requirements for SwiftKey. To
account for the data imbalance, we utilize the widely used GPT to generate one
or more tags for each emoji class. For each emoji and corresponding tags, we
merge the original set with GPT-generated sentences and label them with this
emoji without human intervention to alleviate the data imbalance. At inference
time, we interpolate the emoji output with the user history for emojis for
better emoji classifications. Results show that the proposed on-device emoji
classifier deployed for SwiftKey increases the accuracy performance of emoji
prediction particularly on rare emojis and emoji engagement.
comment: 8 pages
♻ ☆ DeepThink: Aligning Language Models with Domain-Specific User Intents
Supervised fine-tuning with synthesized instructions has been a common
practice for adapting LLMs to domain-specific QA tasks. However, the
synthesized instructions deviate from real user questions and expected answers.
This study proposes a novel framework called DeepThink to generate high-quality
instructions. DeepThink first generates a few seed questions to mimic actual
user questions, simulates conversations to uncover the hidden user needs, and
refines the answer by conversational contexts and the retrieved documents for
more comprehensive answers. Experiments demonstrate that DeepThink achieves an
average performance improvement of 7.92% compared to a GPT-4-turbo+RAG-based
assistant on the real user test set in the advertising domain across dimensions
such as relevance, completeness, clarity, accuracy, and actionability.
♻ ☆ Enhancing Large Language Model Performance with Gradient-Based Parameter Selection AAAI 2025
Large language models (LLMs) have revolutionized lots of fields of research.
Although it is well-known that fine-tuning is essential for enhancing the
capabilities of LLMs, existing research suggests that there is potential
redundancy in the fine-tuning process and therefore proposes to update only a
subset of parameters. However, these methods fail to leverage the task-specific
information to identify important parameters during training. Based on the
insight that gradients inherently contain information on task-specific data, we
propose Gradient-Mask Tuning (GMT), a method that selectively updates
parameters during training based on their gradient information. Specifically,
we compute the absolute values of the gradients and apply masking to those with
relatively smaller magnitudes. Our empirical results across various tasks
demonstrate that GMT not only outperforms traditional fine-tuning methods but
also elevates the upper limits of LLM performance. Further analysis indicates
that GMT exhibits insensitivity to mask ratio and possesses computational
efficiency comparable to vanilla SFT.
comment: Accepted by AAAI 2025
♻ ☆ ACEBench: Who Wins the Match Point in Tool Usage?
Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu
Large Language Models (LLMs) have demonstrated significant potential in
decision-making and reasoning, particularly when integrated with various tools
to effectively solve complex problems. However, existing benchmarks for
evaluating LLMs' tool usage face several limitations: (1) limited evaluation
scenarios, often lacking assessments in real multi-turn dialogue contexts; (2)
narrow evaluation dimensions, with insufficient detailed assessments of how
LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation,
which introduces significant overhead. To address these challenges, we
introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs.
ACEBench categorizes data into three primary types based on evaluation
methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic
scenarios; "Special" evaluates tool usage in situations with ambiguous or
incomplete instructions; "Agent" evaluates tool usage through multi-agent
interactions to simulate real-world, multi-turn dialogues. We conducted
extensive experiments using ACEBench, analyzing various LLMs in-depth and
providing a more granular examination of error causes across different data
types.
♻ ☆ ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation
Automated radiology report generation (R2Gen) has advanced significantly,
introducing challenges in accurate evaluation due to its complexity.
Traditional metrics often fall short by relying on rigid word-matching or
focusing only on pathological entities, leading to inconsistencies with human
assessments. To bridge this gap, we introduce ReFINE, an automatic evaluation
metric designed specifically for R2Gen. Our metric utilizes a reward model,
guided by our margin-based reward enforcement loss, along with a tailored
training data design that enables customization of evaluation criteria to suit
user-defined needs. It not only scores reports according to user-specified
criteria but also provides detailed sub-scores, enhancing interpretability and
allowing users to adjust the criteria between different aspects of reports.
Leveraging GPT-4, we designed an easy-to-use data generation pipeline, enabling
us to produce extensive training data based on two distinct scoring systems,
each containing reports of varying quality along with corresponding scores.
These GPT-generated reports are then paired as accepted and rejected samples
through our pairing rule to train an LLM towards our fine-grained reward model,
which assigns higher rewards to the report with high quality. Our
reward-control loss enables this model to simultaneously output multiple
individual rewards corresponding to the number of evaluation criteria, with
their summation as our final ReFINE. Our experiments demonstrate ReFINE's
heightened correlation with human judgments and superior performance in model
selection compared to traditional metrics. Notably, our model provides both an
overall score and individual scores for each evaluation item, enhancing
interpretability. We also demonstrate its flexible training across various
evaluation systems.
♻ ☆ On the Creativity of Large Language Models
Large Language Models (LLMs) are revolutionizing several areas of Artificial
Intelligence. One of the most remarkable applications is creative writing,
e.g., poetry or storytelling: the generated outputs are often of astonishing
quality. However, a natural question arises: can LLMs be really considered
creative? In this article, we first analyze the development of LLMs under the
lens of creativity theories, investigating the key open questions and
challenges. In particular, we focus our discussion on the dimensions of value,
novelty, and surprise as proposed by Margaret Boden in her work. Then, we
consider different classic perspectives, namely product, process, press, and
person. We discuss a set of ``easy'' and ``hard'' problems in machine
creativity, presenting them in relation to LLMs. Finally, we examine the
societal impact of these technologies with a particular focus on the creative
industries, analyzing the opportunities offered, the challenges arising from
them, and the potential associated risks, from both legal and ethical points of
view.
comment: Published in AI & SOCIETY at
https://link.springer.com/article/10.1007/s00146-024-02127-3
♻ ☆ AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
Despite the outstanding capabilities of large language models (LLMs),
knowledge-intensive reasoning still remains a challenging task due to LLMs'
limitations in compositional reasoning and the hallucination problem. A
prevalent solution is to employ chain-of-thought (CoT) with retrieval-augmented
generation (RAG), which first formulates a reasoning plan by decomposing
complex questions into simpler sub-questions, and then applies iterative RAG at
each sub-question. However, prior works exhibit two crucial problems:
inadequate reasoning planning and poor incorporation of heterogeneous
knowledge. In this paper, we introduce AtomR, a framework for LLMs to conduct
accurate heterogeneous knowledge reasoning at the atomic level. Inspired by how
knowledge graph query languages model compositional reasoning through combining
predefined operations, we propose three atomic knowledge operators, a unified
set of operators for LLMs to retrieve and manipulate knowledge from
heterogeneous sources. First, in the reasoning planning stage, AtomR decomposes
a complex question into a reasoning tree where each leaf node corresponds to an
atomic knowledge operator, achieving question decomposition that is highly
fine-grained and orthogonal. Subsequently, in the reasoning execution stage,
AtomR executes each atomic knowledge operator, which flexibly selects,
retrieves, and operates atomic level knowledge from heterogeneous sources. We
also introduce BlendQA, a challenging benchmark specially tailored for
heterogeneous knowledge reasoning. Experiments on three single-source and two
multi-source datasets show that AtomR outperforms state-of-the-art baselines by
a large margin, with F1 score improvements of 9.4% on 2WikiMultihop and 9.5% on
BlendQA. We release our code and datasets.
♻ ☆ Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory NAACL 25
Privacy research has attracted wide attention as individuals worry that their
private data can be easily leaked during interactions with smart devices,
social platforms, and AI applications. Computer science researchers, on the
other hand, commonly study privacy issues through privacy attacks and defenses
on segmented fields. Privacy research is conducted on various sub-fields,
including Computer Vision (CV), Natural Language Processing (NLP), and Computer
Networks. Within each field, privacy has its own formulation. Though pioneering
works on attacks and defenses reveal sensitive privacy issues, they are
narrowly trapped and cannot fully cover people's actual privacy concerns.
Consequently, the research on general and human-centric privacy research
remains rather unexplored. In this paper, we formulate the privacy issue as a
reasoning problem rather than simple pattern matching. We ground on the
Contextual Integrity (CI) theory which posits that people's perceptions of
privacy are highly correlated with the corresponding social context. Based on
such an assumption, we develop the first comprehensive checklist that covers
social identities, private attributes, and existing privacy regulations. Unlike
prior works on CI that either cover limited expert annotated norms or model
incomplete social context, our proposed privacy checklist uses the whole Health
Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to
show that we can resort to large language models (LLMs) to completely cover the
HIPAA's regulations. Additionally, our checklist also gathers expert
annotations across multiple ontologies to determine private information
including but not limited to personally identifiable information (PII). We use
our preliminary results on the HIPAA to shed light on future context-centric
privacy research to cover more privacy regulations, social norms and standards.
comment: To appear at NAACL 25
♻ ☆ Language Models as Continuous Self-Evolving Data Engineers
Large Language Models (LLMs) have demonstrated remarkable capabilities on
various tasks, while the further evolvement is limited to the lack of
high-quality training data. In addition, traditional training approaches rely
too much on expert-labeled data, setting a ceiling on the performance of LLMs.
To address this issue, we propose a novel paradigm named LANCE (LANguage models
as Continuous self-Evolving data engineers) that enables LLMs to train
themselves by autonomously generating, cleaning, reviewing, and annotating data
with preference information. Our approach demonstrates that LLMs can serve as
continuous self-evolving data engineers, significantly reducing the time and
cost of the post-training data construction. Through iterative fine-tuning on
Qwen2 series models, we validate the effectiveness of LANCE across various
tasks, showing that it can maintain high-quality data generation and
continuously improve model performance. Across multiple benchmark dimensions,
LANCE results in an average score enhancement of 3.64 for Qwen2-7B and 1.75 for
Qwen2-7B-Instruct. This training paradigm with autonomous data construction not
only reduces the reliance on human experts or external models but also ensures
that the data aligns with human preferences, paving the way for the development
of future superintelligent systems that can exceed human capabilities. Codes
are available at: https://github.com/Control-derek/LANCE.
♻ ☆ Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Large Language Models (LLMs) have revolutionized artificial intelligence,
demonstrating remarkable computational power and linguistic capabilities.
However, these models are inherently prone to various biases stemming from
their training data. These include selection, linguistic, and confirmation
biases, along with common stereotypes related to gender, ethnicity, sexual
orientation, religion, socioeconomic status, disability, and age. This study
explores the presence of these biases within the responses given by the most
recent LLMs, analyzing the impact on their fairness and reliability. We also
investigate how known prompt engineering techniques can be exploited to
effectively reveal hidden biases of LLMs, testing their adversarial robustness
against jailbreak prompts specially crafted for bias elicitation. Extensive
experiments are conducted using the most widespread LLMs at different scales,
confirming that LLMs can still be manipulated to produce biased or
inappropriate responses, despite their advanced capabilities and sophisticated
alignment processes. Our findings underscore the importance of enhancing
mitigation techniques to address these safety issues, toward a more sustainable
and inclusive artificial intelligence.
♻ ☆ Exploring Large Language Models for Knowledge Graph Completion ICASSP 2025
Knowledge graphs play a vital role in numerous artificial intelligence tasks,
yet they frequently face the issue of incompleteness. In this study, we explore
utilizing Large Language Models (LLM) for knowledge graph completion. We
consider triples in knowledge graphs as text sequences and introduce an
innovative framework called Knowledge Graph LLM (KG-LLM) to model these
triples. Our technique employs entity and relation descriptions of a triple as
prompts and utilizes the response for predictions. Experiments on various
benchmark knowledge graphs demonstrate that our method attains state-of-the-art
performance in tasks such as triple classification and relation prediction. We
also find that fine-tuning relatively smaller models (e.g., LLaMA-7B,
ChatGLM-6B) outperforms recent ChatGPT and GPT-4.
comment: Accepted by the 2025 IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 2025)
♻ ☆ An Evolved Universal Transformer Memory ICLR 2025
Prior methods propose to offset the escalating costs of modern foundation
models by dropping specific parts of their contexts with hand-designed rules,
while attempting to preserve their original performance. We overcome this
trade-off with Neural Attention Memory Models (NAMMs), introducing a learned
network for memory management that improves both the performance and efficiency
of transformers. We evolve NAMMs atop pre-trained transformers to provide
different latent contexts focusing on the most relevant information for
individual layers and attention heads. NAMMs are universally applicable to any
model using self-attention as they condition exclusively on the values in the
produced attention matrices. Learning NAMMs on a small set of problems, we
achieve substantial performance improvements across multiple long-context
benchmarks while cutting the model's input contexts up to a fraction of the
original sizes. We show the generality of our conditioning enables zero-shot
transfer of NAMMs trained only on language to entirely new transformer
architectures even across input modalities, with their benefits carrying over
to vision and reinforcement learning.
comment: Published at ICLR 2025. Source code available at
https://github.com/SakanaAI/evo-memory
♻ ☆ Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study
Lena Schmidt, Kaitlyn Hair, Sergio Graziosi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas
This paper describes a rapid feasibility study of using GPT-4, a large
language model (LLM), to (semi)automate data extraction in systematic reviews.
Despite the recent surge of interest in LLMs there is still a lack of
understanding of how to design LLM-based automation tools and how to robustly
evaluate their performance. During the 2023 Evidence Synthesis Hackathon we
conducted two feasibility studies. Firstly, to automatically extract study
characteristics from human clinical, animal, and social science domain studies.
We used two studies from each category for prompt-development; and ten for
evaluation. Secondly, we used the LLM to predict Participants, Interventions,
Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP
dataset. Overall, results indicated an accuracy of around 80%, with some
variability between domains (82% for human clinical, 80% for animal, and 72%
for studies of human social sciences). Causal inference methods and study
design were the data extraction items with the most errors. In the PICO study,
participants and intervention/control showed high accuracy (>80%), outcomes
were more challenging. Evaluation was done manually; scoring methods such as
BLEU and ROUGE showed limited value. We observed variability in the LLMs
predictions and changes in response quality. This paper presents a template for
future evaluations of LLMs in the context of data extraction for systematic
review automation. Our results show that there might be value in using LLMs,
for example as second or third reviewers. However, caution is advised when
integrating models such as GPT-4 into tools. Further research on stability and
reliability in practical settings is warranted for each type of data that is
processed by the LLM.
comment: Conference proceedings, peer-reviewed and presented at the 3rd
Workshop on Augmented Intelligence for Technology-Assisted Reviews Systems,
Glasgow, 2024
♻ ☆ What Large Language Models Know and What People Think They Know
Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth
As artificial intelligence (AI) systems, particularly large language models
(LLMs), become increasingly integrated into decision-making processes, the
ability to trust their outputs is crucial. To earn human trust, LLMs must be
well calibrated such that they can accurately assess and communicate the
likelihood of their predictions being correct. Whereas recent work has focused
on LLMs' internal confidence, less is understood about how effectively they
convey uncertainty to users. Here we explore the calibration gap, which refers
to the difference between human confidence in LLM-generated answers and the
models' actual confidence, and the discrimination gap, which reflects how well
humans and models can distinguish between correct and incorrect answers. Our
experiments with multiple-choice and short-answer questions reveal that users
tend to overestimate the accuracy of LLM responses when provided with default
explanations. Moreover, longer explanations increased user confidence, even
when the extra length did not improve answer accuracy. By adjusting LLM
explanations to better reflect the models' internal confidence, both the
calibration gap and the discrimination gap narrowed, significantly improving
user perception of LLM accuracy. These findings underscore the importance of
accurate uncertainty communication and highlight the effect of explanation
length in influencing user trust in AI-assisted decision-making environments.
Code and Data can be found at https://osf.io/y7pr6/ . Journal publication can
be found on Nature Machine Intelligence at
https://www.nature.com/articles/s42256-024-00976-7 .
comment: 27 pages, 10 figures For the journal publication on Nature Machine
Intelligence see https://www.nature.com/articles/s42256-024-00976-7 For the
data and code see https://osf.io/y7pr6/
♻ ☆ Hallucination is Inevitable: An Innate Limitation of Large Language Models
Hallucination has been widely recognized to be a significant drawback for
large language models (LLMs). There have been many works that attempt to reduce
the extent of hallucination. These efforts have mostly been empirical so far,
which cannot answer the fundamental question whether it can be completely
eliminated. In this paper, we formalize the problem and show that it is
impossible to eliminate hallucination in LLMs. Specifically, we define a formal
world where hallucination is defined as inconsistencies between a computable
LLM and a computable ground truth function. By employing results from learning
theory, we show that LLMs cannot learn all the computable functions and will
therefore inevitably hallucinate if used as general problem solvers. Since the
formal world is a part of the real world which is much more complicated,
hallucinations are also inevitable for real world LLMs. Furthermore, for real
world LLMs constrained by provable time complexity, we describe the
hallucination-prone tasks and empirically validate our claims. Finally, using
the formal world framework, we discuss the possible mechanisms and efficacies
of existing hallucination mitigators as well as the practical implications on
the safe deployment of LLMs.
♻ ☆ CharacterGPT: A Persona Reconstruction Framework for Role-Playing Agents NAACL 2025
With the recent introduction of Assistants API, it is expected that
document-based language models will be actively used in various domains,
especially Role-playing. However, a key challenge lies in utilizing
protagonist's persona: Assistants API often fails to achieve with its search
because the information extraction part is different each time and it often
omits important information such as protagonist's backstory or relationships.
It is hard to maintain a consistent persona simply by using the persona
document as input to the Assistants API. To address the challenge of achieving
stable persona consistency, we propose CharacterGPT, a novel persona
reconstruction framework to alleviate the shortcomings of the Assistants API.
Our method involves Character Persona Training (CPT), an effective persona
rebuilding process that updates the character persona by extracting the
character's traits from given summary of the novel for each character as if the
story in a novel progresses. In our experiments, we ask each character to take
the Big Five Inventory personality test in various settings and analyze the
results. To assess whether it can think outside the box, we let each character
generate short novels. Extensive experiments and human evaluation demonstrate
that CharacterGPT presents new possibilities for role-playing agent research.
Code and results are available at: https://github.com/Jeiyoon/charactergpt
comment: NAACL 2025 Industry Track (Oral)
♻ ☆ Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning
Recent advancements in large language models (LLMs) have demonstrated
remarkable reasoning capabilities. However, single-shot inference often yields
unreliable results for complex reasoning tasks, leading researchers to explore
multiple reasoning paths through methods such as perplexity and
self-consistency. In this paper, we present the first theoretical error
decomposition analysis of these techniques, breaking down their error into
estimation error and model error. Our analysis reveals a fundamental trade-off:
perplexity methods suffer from substantial model error due to the absence of a
proper consistency function, while self-consistency exhibits high estimation
error due to a slow error convergence rate. To overcome these limitations, we
propose Reasoning-Pruning Perplexity Consistency (RPC). This approach combines
Perplexity Consistency, which seamlessly integrates LLM perplexity with
self-consistency, and Reasoning Pruning, which eliminates low-probability
reasoning paths to effectively prevent the degeneration of estimation error
reduction. Theoretical analysis demonstrates that RPC not only accelerates the
convergence rate of estimation error to an exponential level but also holds
strong potential for further reducing model error. Extensive empirical
evaluations on seven benchmark datasets confirm that RPC can significantly
improve reasoning performance, sample efficiency, and confidence reliability.
comment: Preliminary work
♻ ☆ Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Steel-LLM is a Chinese-centric language model developed from scratch with the
goal of creating a high-quality, open-source model despite limited
computational resources. Launched in March 2024, the project aimed to train a
1-billion-parameter model on a large-scale dataset, prioritizing transparency
and the sharing of practical insights to assist others in the community. The
training process primarily focused on Chinese data, with a small proportion of
English data included, addressing gaps in existing open-source LLMs by
providing a more detailed and practical account of the model-building journey.
Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL
and CMMLU, outperforming early models from larger institutions. This paper
provides a comprehensive summary of the project's key contributions, including
data collection, model design, training methodologies, and the challenges
encountered along the way, offering a valuable resource for researchers and
practitioners looking to develop their own LLMs. The model checkpoints and
training script are available at https://github.com/zhanshijinwat/Steel-LLM.
♻ ☆ My Words Imply Your Opinion: Reader Agent-Based Propagation Enhancement for Personalized Implicit Emotion Analysis
The subtlety of emotional expressions makes implicit emotion analysis (IEA)
particularly sensitive to user-specific characteristics. Current studies
personalize emotion analysis by focusing on the author but neglect the impact
of the intended reader on implicit emotional feedback. In this paper, we
introduce Personalized IEA (PIEA) and present the RAPPIE model, which addresses
subjective variability by incorporating reader feedback. In particular, (1) we
create reader agents based on large language models to simulate reader
feedback, overcoming the issue of ``spiral of silence effect'' and data
incompleteness of real reader reaction. (2) We develop a role-aware multi-view
graph learning to model the emotion interactive propagation process in
scenarios with sparse reader information. (3) We construct two new PIEA
datasets covering English and Chinese social media with detailed user metadata,
addressing the text-centric limitation of existing datasets. Extensive
experiments show that RAPPIE significantly outperforms state-of-the-art
baselines, demonstrating the value of incorporating reader feedback in PIEA.
♻ ☆ Loss Landscape Degeneracy Drives Stagewise Development in Transformers
Deep learning involves navigating a high-dimensional loss landscape over the
neural network parameter space. Over the course of training, complex
computational structures form and re-form inside the neural network, leading to
shifts in input/output behavior. It is a priority for the science of deep
learning to uncover principles governing the development of neural network
structure and behavior. Drawing on the framework of singular learning theory,
we propose that model development is deeply linked to degeneracy in the local
geometry of the loss landscape. We investigate this link by monitoring loss
landscape degeneracy throughout training, as quantified by the local learning
coefficient, for a transformer language model and an in-context linear
regression transformer. We show that training can be divided into distinct
periods of change in loss landscape degeneracy, and that these changes in
degeneracy coincide with significant changes in the internal computational
structure and the input/output behavior of the transformers. This finding
underscores the potential of a degeneracy-based perspective for understanding
modern deep learning.
comment: Material on essential dynamics from v1 of this preprint has been
removed from v2 and developed in arXiv:2501.17745
♻ ☆ LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM
Zhi Zhou, Kun-Yang Yu, Shi-Yu Tian, Xiao-Wen Yang, Jiang-Xin Shi, Pengxiao Song, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Large language models (LLMs), both proprietary and open-source, have
demonstrated remarkable capabilities across various natural language processing
tasks. However, they face significant limitations in legal reasoning tasks.
Proprietary models introduce data privacy risks and high inference costs, while
open-source models underperform due to insufficient legal domain training data.
To address these limitations, we study data generation for legal reasoning to
improve the legal reasoning performance of open-source LLMs with the help of
proprietary LLMs. This is challenging due to the lack of legal knowledge in
proprietary LLMs and the difficulty in verifying the generated data. We propose
KgDG, a knowledge-guided data generation framework for legal reasoning. Our
framework enables leveraging legal knowledge to enhance generation diversity
and introduces a refinement and verification process to ensure the quality of
generated data. Moreover, we expand the generated dataset to further enhance
the LLM reasoning capabilities. Using KgDG, we create a synthetic legal
reasoning dataset containing 50K high-quality examples. Our trained model
LawGPT outperforms existing legal-specific LLMs and achieves performance
comparable to proprietary LLMs, demonstrating the effectiveness of KgDG and
LawGPT. Our code and resources is publicly available at
https://github.com/LAMDASZ-ML/Knowledge-Guide-Data-Generation .
comment: Preprint
♻ ☆ How Does Knowledge Selection Help Retrieval Augmented Generation?
Retrieval-augmented generation (RAG) is a powerful method for enhancing
natural language generation by integrating external knowledge into a model's
output. While prior work has demonstrated the importance of improving knowledge
retrieval for boosting generation quality, the role of knowledge selection
remains less clear. In this paper, we perform a comprehensive analysis of how
knowledge selection influences downstream generation performance in RAG
systems. By simulating different retrieval and selection conditions through a
controlled mixture of gold and distractor knowledge, we assess the impact of
these factors on generation outcomes. Our findings indicate that the downstream
generator model's capability, as well as the complexity of the task and
dataset, significantly influence the impact of knowledge selection on the
overall RAG system performance. In typical scenarios, improving the knowledge
recall score is key to enhancing generation outcomes, with the knowledge
selector providing a limited additional benefit when a strong generator model
is used on clear, well-defined tasks. For weaker generator models or more
ambiguous tasks and datasets, the knowledge F1 score becomes a critical factor,
and the knowledge selector plays a more prominent role in improving overall
performance.
♻ ☆ Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, Kun Kuang
Achieving balanced alignment of large language models (LLMs) in terms of
Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a
cornerstone of responsible AI, with existing methods like data mixture
strategies facing limitations including reliance on expert knowledge and
conflicting optimization signals. While model merging offers a promising
alternative by integrating specialized models, its potential for 3H
optimization remains underexplored. This paper establishes the first
comprehensive benchmark for model merging in 3H-aligned LLMs, systematically
evaluating 15 methods (12 training-free merging and 3 data mixture techniques)
across 10 datasets associated with 5 annotation dimensions, 2 LLM families, and
2 training paradigms. Our analysis reveals three pivotal insights: (i)
previously overlooked collaborative/conflicting relationships among 3H
dimensions, (ii) the consistent superiority of model merging over data mixture
approaches in balancing alignment trade-offs, and (iii) the critical role of
parameter-level conflict resolution through redundant component pruning and
outlier mitigation. Building on these findings, we propose R-TSVM, a
Reweighting-enhanced Task Singular Vector Merging method that incorporates
outlier-aware parameter weighting and sparsity-adaptive rank selection
strategies adapted to the heavy-tailed parameter distribution and sparsity for
LLMs, further improving LLM alignment across multiple evaluations. We release
our trained models for further exploration.
♻ ☆ LegalViz: Legal Text Visualization by Text To Diagram Generation NAACL2025
Legal documents including judgments and court orders require highly
sophisticated legal knowledge for understanding. To disclose expert knowledge
for non-experts, we explore the problem of visualizing legal texts with
easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
languages and 7,010 cases of legal document and visualization pairs, using the
DOT graph description language of Graphviz. LegalViz provides a simple diagram
from a complicated legal corpus identifying legal entities, transactions, legal
sources, and statements at a glance, that are essential in each judgment. In
addition, we provide new evaluation metrics for the legal diagram visualization
by considering graph structures, textual similarities, and legal contents. We
conducted empirical studies on few-shot and finetuning large language models
for generating legal diagrams and evaluated them with these metrics, including
legal content-based evaluation within 23 languages. Models trained with
LegalViz outperform existing models including GPTs, confirming the
effectiveness of our dataset.
comment: NAACL2025
♻ ☆ Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
Sequence labeling remains a significant challenge in low-resource,
domain-specific scenarios, particularly for character-dense languages like
Chinese. Existing methods primarily focus on enhancing model comprehension and
improving data diversity to boost performance. However, these approaches still
struggle with inadequate model applicability and semantic distribution biases
in domain-specific contexts. To overcome these limitations, we propose a novel
framework that combines an LLM-based knowledge enhancement workflow with a
span-based Knowledge Fusion for Rich and Efficient Extraction (KnowFREE) model.
Our workflow employs explanation prompts to generate precise contextual
interpretations of target entities, effectively mitigating semantic biases and
enriching the model's contextual understanding. The KnowFREE model further
integrates extension label features, enabling efficient nested entity
extraction without relying on external knowledge during inference. Experiments
on multiple Chinese domain-specific sequence labeling datasets demonstrate that
our approach achieves state-of-the-art performance, effectively addressing the
challenges posed by low-resource settings.
♻ ☆ Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Similarity between training examples is used to curate pretraining datasets
for language models by many methods -- for diversification and to select
examples similar to high-quality data. However, similarity is typically
measured with off-the-shelf embedding models that are generic or trained for
tasks such as retrieval. This paper introduces a framework to analyze the
suitability of embedding models specifically for data curation in the language
model pretraining setting. We quantify the correlation between similarity in
the embedding space to similarity in pretraining loss between different
training examples, and how diversifying in the embedding space affects
pretraining quality. We analyze a variety of embedding models in our framework,
with experiments using the Pile dataset for pretraining a 1.7B parameter
decoder-only language model. We find that the embedding models we consider are
all useful for pretraining data curation. Moreover, a simple approach of
averaging per-token embeddings proves to be surprisingly competitive with more
sophisticated embedding models -- likely because the latter are not designed
specifically for pretraining data curation. Indeed, we believe our analysis and
evaluation framework can serve as a foundation for the design of embedding
models that specifically reason about similarity in pretraining datasets.
comment: 14 pages
♻ ☆ Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties NAACL
A language can have different varieties. These varieties can affect the
performance of natural language processing (NLP) models, including large
language models (LLMs), which are often trained on data from widely spoken
varieties. This paper introduces a novel and cost-effective approach to
benchmark model performance across language varieties. We argue that
international online review platforms, such as Booking.com, can serve as
effective data sources for constructing datasets that capture comments in
different language varieties from similar real-world scenarios, like reviews
for the same hotel with the same rating using the same language (e.g., Mandarin
Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland
Mandarin). To prove this concept, we constructed a contextually aligned dataset
comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs
in a sentiment analysis task. Our results show that LLMs consistently
underperform in Taiwan Mandarin.
comment: Accepted by 2025 Annual Conference of the Nations of the Americas
Chapter of the Association for Computational Linguistics (NAACL), theme track
♻ ☆ Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment
Entity alignment (EA) aims at identifying equivalent entity pairs across
different knowledge graphs (KGs) that refer to the same real-world identity. To
systematically combat confirmation bias for pseudo-labeling-based entity
alignment, we propose a Unified Pseudo-Labeling framework for Entity Alignment
(UPL-EA) that explicitly eliminates pseudo-labeling errors to boost the
accuracy of entity alignment. UPL-EA consists of two complementary components:
(1) The Optimal Transport (OT)-based pseudo-labeling uses discrete OT modeling
as an effective means to enable more accurate determination of entity
correspondences across two KGs and to mitigate the adverse impact of erroneous
matches. A simple but highly effective criterion is further devised to derive
pseudo-labeled entity pairs that satisfy one-to-one correspondences at each
iteration. (2) The cross-iteration pseudo-label calibration operates across
multiple consecutive iterations to further improve the pseudo-labeling
precision rate by reducing the local pseudo-label selection variability with a
theoretical guarantee. The two components are respectively designed to
eliminate Type I and Type II pseudo-labeling errors identified through our
analyse. The calibrated pseudo-labels are thereafter used to augment prior
alignment seeds to reinforce subsequent model training for alignment inference.
The effectiveness of UPL-EA in eliminating pseudo-labeling errors is both
theoretically supported and experimentally validated. The experimental results
show that our approach achieves competitive performance with limited prior
alignment seeds.
♻ ☆ FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training
Recent advancements in large language models (LLMs) have shown promise in
medical applications such as disease diagnosis and treatment planning. However,
most existing medical LLMs struggle with the advanced reasoning required for
complex clinical scenarios, such as differential diagnosis or personalized
treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality
synthetic medical data and long-form reasoning data for Supervised Fine-Tuning
(SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and
deep reasoning capabilities. Additionally, we introduced Test-Time Training
(TTT) in the medical domain for the first time, facilitating domain adaptation
and ensuring reliable, accurate reasoning. Experimental results demonstrate
that FineMedLM-o1 achieves a 23% average performance improvement over prior
models on key medical benchmarks. Furthermore, the introduction of TTT provides
an additional 14% performance boost, highlighting its effectiveness in
enhancing medical reasoning capabilities. To support this process, we also
proposed a novel method for synthesizing medical dialogue. Compared to other
open-source datasets, our dataset stands out as superior in both quality and
complexity. The project and data will be released on GitHub.
♻ ☆ A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Large Vision-Language Models (LVLMs), despite their recent success, are
hardly comprehensively tested for their cognitive abilities. Inspired by the
prevalent use of the Cookie Theft task in human cognitive tests, we propose a
novel evaluation benchmark to evaluate high-level cognitive abilities of LVLMs
using images with rich semantics. The benchmark consists of 251 images along
with comprehensive annotations. It defines eight reasoning capabilities and
comprises an image description task and a visual question answering task. Our
evaluation of well-known LVLMs shows that there is still a significant gap in
cognitive abilities between LVLMs and humans.
♻ ☆ VaiBot: Shuttle Between the Instructions and Parameters of Large Language Models
How to interact with LLMs through \emph{instructions} has been widely studied
by researchers. However, previous studies have treated the emergence of
instructions and the training of LLMs on task data as separate processes,
overlooking the inherent unity between the two. This paper proposes a neural
network framework, VaiBot, that integrates VAE and VIB, designed to uniformly
model, learn, and infer both deduction and induction tasks under LLMs. Through
experiments, we demonstrate that VaiBot performs on par with existing baseline
methods in terms of deductive capabilities while significantly surpassing them
in inductive capabilities. We also find that VaiBot can scale up using general
instruction-following data and exhibits excellent one-shot induction abilities.
We finally synergistically integrate the deductive and inductive processes of
VaiBot. Through T-SNE dimensionality reduction, we observe that its
inductive-deductive process significantly improves the distribution of training
parameters, enabling it to outperform baseline methods in inductive reasoning
tasks. The code and data for this paper can be found at
https://anonymous.4open.science/r/VaiBot-021F.
♻ ☆ Graph-based Retrieval Augmented Generation for Dynamic Few-shot Text Classification
Text classification is a fundamental task in data mining, pivotal to various
applications such as tabular understanding and recommendation. Although neural
network-based models, such as CNN and BERT, have demonstrated remarkable
performance in text classification, their effectiveness heavily relies on
abundant labeled training data. This dependency makes these models less
effective in dynamic few-shot text classification, where labeled data is
scarce, and new target labels frequently appear based on application needs.
Recently, large language models (LLMs) have shown promise due to their
extensive pretraining and contextual understanding ability. Current approaches
provide LLMs with text inputs, candidate labels, and additional side
information (e.g., descriptions) to classify texts. However, their
effectiveness is hindered by the increased input size and the noise introduced
through side information processing. To address these limitations, we propose a
graph-based online retrieval-augmented generation framework, namely GORAG, for
dynamic few-shot text classification. Rather than treating each input
independently, GORAG constructs and maintains a weighted graph by extracting
side information across all target texts. In this graph, text keywords and
labels are represented as nodes, with edges indicating the correlations between
them. To model these correlations, GORAG employs an edge weighting mechanism to
prioritize the importance and reliability of extracted information and
dynamically retrieves relevant context using a minimum-cost spanning tree
tailored for each text input. Empirical evaluations demonstrate that GORAG
outperforms existing approaches by providing more comprehensive and precise
contextual information.
♻ ☆ Training Sparse Mixture Of Experts Text Embedding Models
Transformer-based text embedding models have improved their performance on
benchmarks like MIRACL and BEIR by increasing their parameter counts. However,
this scaling approach introduces significant deployment challenges, including
increased inference latency and memory usage. These challenges are particularly
severe in retrieval-augmented generation (RAG) applications, where large
models' increased memory requirements constrain dataset ingestion capacity, and
their higher latency directly impacts query-time performance. While causal
language models have addressed similar efficiency challenges using Mixture of
Experts (MoE) architectures, this approach hasn't been successfully adapted to
the general text embedding setting. In this paper, we introduce Nomic Embed v2,
the first general purpose MoE text embedding model. Our model outperforms
models in the same parameter class on both monolingual and multilingual
benchmarks while also maintaining competitive performance with models twice its
size. We open-source all code, models, and evaluation data to ensure full
reproducibility of our training pipeline at
\href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.